Stat 546 Homework 1

Instruction
Question 1: Image Pixel Smoothing [25 pts]
Question 2: Positive-Definiteness of Kernels [40 pts]
Question 3: Uniqueness of Kernel Functions [10 pts]
Question 4: Kernel Logistic Regression [25 pts]

Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

You are required to submit the rendered file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file. .html format cannot be accepted since they may not be readable in Gradescope. All proofs must be typed in LaTeX format. Make all of your R code chunks visible for grading.
Include your Name and NetID in the report.
If you use this file or the example homework .Rmd file as a template, be sure to remove this instruction section.
On random seed and reproducibility: You should use R version \(\geq 4.0.0\). This will ensure your random seed generation is the same as everyone else. Please note that updating the R version may require you to reinstall all of your packages. In the markdown file, set seed properly so that the results can be reproduced.
For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.
Using AI tools: you should provide a brief statement on the involvement of AI tools on your homework.

Question 1: Image Pixel Smoothing [25 pts]

Load an image from the ElemStatLearn package. This is an archived package on CRAN. If you do not have this package, download and install the 2015.6.26.2 version at here, or use the install_github() function from the devtools package. Here is the plot of the first image in the zip.train dataset, which is a digit 6. The resolution of this image is \(16 \times 16\). “blow it up” to \(48 \times 48\) by replicating each pixel into a \(3 \times 3\) block. Here is a plot of the original and enlarged image, they look exactly the same, just different dimensions.

  # Handwritten Digit Recognition Data
  library(ElemStatLearn)

  # plot two images
  par(mfrow=c(1,2), mar=c(0,0,1,0))

  # look at the first sample
  img16 <- zip2image(zip.train, 1)

## [1] "digit  6  taken"

  image(img16, col=gray(256:0/256), zlim=c(0,1), 
        xlab="", ylab="", axes = FALSE)
  
  # change the resolution of this image to 48 x 48
  img48 <- img16[rep(1:16, each = 3), rep(1:16, each = 3)]

  # plot the enlarged image
  image(img48, col = gray(256:0/256), zlim=c(0,1), 
        xlab="", ylab="", axes=FALSE)

Although the second image is larger, it is still very pixelated. Let’s consider a way to make it better. Treat the two dimensions of the 48 x 48 image as two covariates, and apply a (2-dimensional) smoothing method with RKHS to obtain a smoothed pixel gray scale values and make the image look better. For this question, you should

Use the Matern kernel (google this yourself) with smoothness parameter \(\nu = 1.5\) and pick and experiment your own length scale parameter \(\rho\).
Use the kernel ridge regression (KRR) method to fit the pixel values. You need to write the code yourself instead of using existing KRR packages.
This involves inverting a large matrix. You may want to consider calculating the kernel matrix in a more efficient way.
After obtaining the fitted values, plot the smoothed image with the original 48x48 image side by side.
You do not need to find the best \(\rho\) or penalty parameter \(\lambda\). Just pick a reasonable value and see how they affect the results, and report your findings.

Question 2: Positive-Definiteness of Kernels [40 pts]

Let \(\cal X\) be a set and \(\cal F\) be a Hilbert space. \(\Phi\) is a map from \(\cal X\) to \(\cal F\). If we define a kernel function \(k(\cdot, \cdot)\) as \[ k(x, x') = \langle \Phi(x), \Phi(x') \rangle_{\cal F}, \quad \forall x, x' \in \cal X, \] show that \(k\) is a positive definite kernel (Hint: use definition).
Suppose \(k_1\) and \(k_2\) are two positive definite kernels on \(\cal X\). Show that the kernel \(k = k_1 + k_2\) is also a positive definite kernel.
Consider a kernel function as \[ k(x, x') = 1\{ | x - x' | < \sigma \} \] is this a positive definite kernel? If yes, prove it. If no, give a counter example.

Question 3: Uniqueness of Kernel Functions [10 pts]

Show that if a reproducing kernel \(k(\cdot, \cdot)\) exists for a Hilbert space \(\cal H \in \mathbb{R}^{\cal X}\), then it is unique. Hint: you may consider assuming that there are two different kernel functions \(k_1\) and \(k_2\) for \(\cal H\), and then show that the Hilbert norm of their difference \(\|k_1(x, \cdot) - k_2(x, \cdot)\|_{\cal H}\) is zero for any \(x \in \cal X\). To do that, expand this equation and use properties of a Hilbert space and the reproducing property.

Question 4: Kernel Logistic Regression [25 pts]

By the same approach as kernel ridge regression, we can also use the kernel method to do logistic regression. For a logistic regression with linear predictor, the -2 log-likelihood function is given by

\[ -2 \ell(\beta) = -2 \sum_{i=1}^n \left[ y_i x_i^T \beta - \log(1 + \exp(x_i^T \beta)) \right], \] where we used \(x_i^T \beta\) to model the effect. Now we know that this can be changed to \(f(x_i)\) for some function \(f \in \cal H\). And all we need to do is to specify a kernel function, and also add a penalty term \(\lambda \|f\|_{\cal H}^2\). So the optimization problem becomes

\[ \widehat\alpha = \arg\min_{\alpha \in \mathbb{R}^n} \left\{ -2 \sum_{i=1}^n \left[ y_i K_i^T \alpha - \log(1 + \exp(K_i^T \alpha) \right] + \lambda \alpha^T K \alpha \right\}, \]

where \(K_i\) is the \(i\)-th row in the kernel matrix \(K\). The analytic solution does not exist, but you can define this loss function and throw it into the optim() function in R to get the solution. For this question, use the following training and testing data from the handwritten digit dataset with digits 5 and 6 only. For this question, you should:

Define your own kernel function \(K(\cdot, \cdot)\) on two images. You may consider a radial basis function (RBF) kernel, and can pick any reasonable value for \(\sigma\). Again, you do not need to optimize \(\sigma\). Just pick a reasonable value as long as your result is reasonable.
Write your own loss function as shown above, and use optim() to get the solution of the coefficients.
Use the fitted model to predict the testing data, and report the testing accuracy. Remember, this is a classification problem, so you should report a contingency table. You do not need to optimize \(\lambda\), simply pick a value that works well in test data prediction.

  # get the first 100 observations with digits 5 and 6
  digits5 <- zip.train[zip.train[,1] == 5, ]
  digits5 <- digits5[1:50, ]
  digits6 <- zip.train[zip.train[,1] == 6, ]
  digits6 <- digits6[1:50, ]
  
  train56 <- rbind(digits5, digits6)
  
  # get the testing data 
  digits5_test <- zip.test[zip.test[,1] == 5, ]
  digits5_test <- digits5_test[1:50, ]
  digits6_test <- zip.test[zip.test[,1] == 6, ]
  digits6_test <- digits6_test[1:50, ]
  
  test56 <- rbind(digits5_test, digits6_test)