Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: Image Pixel Smoothing [25 pts]

Load an image from the ElemStatLearn package. This is an archived package on CRAN. If you do not have this package, download and install the 2015.6.26.2 version at here, or use the install_github() function from the devtools package. Here is the plot of the first image in the zip.train dataset, which is a digit 6. The resolution of this image is \(16 \times 16\). “blow it up” to \(48 \times 48\) by replicating each pixel into a \(3 \times 3\) block. Here is a plot of the original and enlarged image, they look exactly the same, just different dimensions.

  # Handwritten Digit Recognition Data
  library(ElemStatLearn)

  # plot two images
  par(mfrow=c(1,2), mar=c(0,0,1,0))

  # look at the first sample
  img16 <- zip2image(zip.train, 1)
## [1] "digit  6  taken"
  image(img16, col=gray(256:0/256), zlim=c(0,1), 
        xlab="", ylab="", axes = FALSE)
  
  # change the resolution of this image to 48 x 48
  img48 <- img16[rep(1:16, each = 3), rep(1:16, each = 3)]

  # plot the enlarged image
  image(img48, col = gray(256:0/256), zlim=c(0,1), 
        xlab="", ylab="", axes=FALSE)

Although the second image is larger, it is still very pixelated. Let’s consider a way to make it better. Treat the two dimensions of the 48 x 48 image as two covariates, and apply a (2-dimensional) smoothing method with RKHS to obtain a smoothed pixel gray scale values and make the image look better. For this question, you should

Question 2: Positive-Definiteness of Kernels [40 pts]

  1. Let \(\cal X\) be a set and \(\cal F\) be a Hilbert space. \(\Phi\) is a map from \(\cal X\) to \(\cal F\). If we define a kernel function \(k(\cdot, \cdot)\) as \[ k(x, x') = \langle \Phi(x), \Phi(x') \rangle_{\cal F}, \quad \forall x, x' \in \cal X, \] show that \(k\) is a positive definite kernel (Hint: use definition).

  2. Suppose \(k_1\) and \(k_2\) are two positive definite kernels on \(\cal X\). Show that the kernel \(k = k_1 + k_2\) is also a positive definite kernel.

  3. Consider a kernel function as \[ k(x, x') = 1\{ | x - x' | < \sigma \} \] is this a positive definite kernel? If yes, prove it. If no, give a counter example.

Question 3: Uniqueness of Kernel Functions [10 pts]

Show that if a reproducing kernel \(k(\cdot, \cdot)\) exists for a Hilbert space \(\cal H \in \mathbb{R}^{\cal X}\), then it is unique. Hint: you may consider assuming that there are two different kernel functions \(k_1\) and \(k_2\) for \(\cal H\), and then show that the Hilbert norm of their difference \(\|k_1(x, \cdot) - k_2(x, \cdot)\|_{\cal H}\) is zero for any \(x \in \cal X\). To do that, expand this equation and use properties of a Hilbert space and the reproducing property.

Question 4: Kernel Logistic Regression [25 pts]

By the same approach as kernel ridge regression, we can also use the kernel method to do logistic regression. For a logistic regression with linear predictor, the -2 log-likelihood function is given by

\[ -2 \ell(\beta) = -2 \sum_{i=1}^n \left[ y_i x_i^T \beta - \log(1 + \exp(x_i^T \beta)) \right], \] where we used \(x_i^T \beta\) to model the effect. Now we know that this can be changed to \(f(x_i)\) for some function \(f \in \cal H\). And all we need to do is to specify a kernel function, and also add a penalty term \(\lambda \|f\|_{\cal H}^2\). So the optimization problem becomes

\[ \widehat\alpha = \arg\min_{\alpha \in \mathbb{R}^n} \left\{ -2 \sum_{i=1}^n \left[ y_i K_i^T \alpha - \log(1 + \exp(K_i^T \alpha) \right] + \lambda \alpha^T K \alpha \right\}, \]

where \(K_i\) is the \(i\)-th row in the kernel matrix \(K\). The analytic solution does not exist, but you can define this loss function and throw it into the optim() function in R to get the solution. For this question, use the following training and testing data from the handwritten digit dataset with digits 5 and 6 only. For this question, you should:

  # get the first 100 observations with digits 5 and 6
  digits5 <- zip.train[zip.train[,1] == 5, ]
  digits5 <- digits5[1:50, ]
  digits6 <- zip.train[zip.train[,1] == 6, ]
  digits6 <- digits6[1:50, ]
  
  train56 <- rbind(digits5, digits6)
  
  # get the testing data 
  digits5_test <- zip.test[zip.test[,1] == 5, ]
  digits5_test <- digits5_test[1:50, ]
  digits6_test <- zip.test[zip.test[,1] == 6, ]
  digits6_test <- digits6_test[1:50, ]
  
  test56 <- rbind(digits5_test, digits6_test)