Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: Image Pixel Smoothing [25 pts]

Load an image from the ElemStatLearn package. This is an archived package on CRAN. If you do not have this package, download and install the 2015.6.26.2 version at here, or use the install_github() function from the devtools package. Here is the plot of the first image in the zip.train dataset, which is a digit 6. The resolution of this image is \(16 \times 16\). “blow it up” to \(48 \times 48\) by replicating each pixel into a \(3 \times 3\) block. Here is a plot of the original and enlarged image, they look exactly the same, just different dimensions.

  # Handwritten Digit Recognition Data
  library(ElemStatLearn)
  # plot two images
  par(mfrow=c(1,2), mar=c(0,0,1,0))
  # look at the first sample
  img16 <- zip2image(zip.train, 1)
## [1] "digit  6  taken"
  image(img16, col=gray(256:0/256), zlim=c(0,1), xlab="", ylab="", axes = FALSE)
  # change the resolution of this image to 48 x 48
  img48 <- img16[rep(1:16, each = 3), rep(1:16, each = 3)]
  # plot the enlarged image
  image(img48, col = gray(256:0/256), zlim=c(0,1), xlab="", ylab="", axes=FALSE)

Although the second image is larger, it is still very pixelated. Let’s consider a way to make it better. Treat the two dimensions of the 48 x 48 image as two covariates, and apply a (2-dimensional) smoothing method with RKHS to obtain a smoothed pixel gray scale values and make the image look better. For this question, you should

-Use the Matern kernel (google this yourself) with smoothness parameter \(\nu = 1.5\) and pick and experiment your own length scale parameter \(\rho\).Use the kernel ridge regression (KRR) method to fit the pixel values. You need to write the code yourself instead of using existing KRR packages.This involves inverting a large matrix. You may want to consider calculating the kernel matrix in a more efficient way.After obtaining the fitted values, plot the smoothed image with the original 48x48 image side by side.You do not need to find the best \(\rho\) or penalty parameter \(\lambda\). Just pick a reasonable value and see how they affect the results, and report your findings.

Solution:

Analytical Equations: (Only the code and the implementation needed. Below are clarifying equations.)

We observe pixel intensities on a grid (here \(48\times 48\)) \[ x_i=(u_i,v_i)\in[0,1]^2,\qquad i=1,\dots,n,\quad n=48^2=2304, \] and let \(y_i\in\mathbb{R}\) be the observed grayscale at that pixel (rescaled to \([0,1]\) or \([0,255]\)). We model each pixel value as \[ y_i \;=\; f(x_i)\;+\;\varepsilon_i,\qquad \mathbb{E}[\varepsilon_i]=0, \] where \(f\) is the unknown smooth image intensity function.

Pick a positive-definite kernel \(k:\mathbb{R}^2\times\mathbb{R}^2\to\mathbb{R}\) with associated RKHS \(\mathcal{H}_k\) and norm \(\|\cdot\|_{\mathcal{H}_k}\) (here Matern kernel). We estimate \(f\) via Tikhonov-regularized least squares:

\[ \widehat{f} \;=\; \arg\min_{f\in\mathcal{H}_k} \Bigg\{ \frac{1}{n}\sum_{i=1}^n \big(y_i-f(x_i)\big)^2 \;+\; \lambda \,\|f\|_{\mathcal{H}_k}^2 \Bigg\}, \quad \lambda>0. \]

By the representer theorem, \(\widehat{f}(x)=\sum_{j=1}^n \alpha_j\, k(x,x_j),\qquad\boldsymbol{\alpha}=(\alpha_1,\ldots,\alpha_n)^\top\)

Let \(K\in\mathbb{R}^{n\times n}\) be the Gram matrix \(K_{ij}=k(x_i,x_j)\), \(\mathbf{y}=(y_1,\dots,y_n)^\top\), and \(I\) the \(n\times n\) identity. Then the coefficients and fitted values are

\[ \boldsymbol{\alpha} =\left(K+n\lambda I\right)^{-1}\mathbf{y}, \qquad \widehat{\mathbf{y}} =K\boldsymbol{\alpha} = S_\lambda\,\mathbf{y}, \]

where \(S_\lambda:=K(K+n\lambda I)^{-1}\) is the smoothing matrix.

Now for any target grid \(X^\star=\{x^\star_1,\dots,x^\star_m\}\) (e.g., the same \(48\times 48\) grid for denoising, or a denser \(128\times128\) grid for super-resolution), we form the cross-kernel \(K^\star\in\mathbb{R}^{m\times n}\) with entries \(K^\star_{\rho j}=k(x^\star_\rho,x_j)\). Then

\[ \widehat{\mathbf{y}}^\star \;=\; K^\star \boldsymbol{\alpha} \;=\; K^\star\,(K+n\lambda I)^{-1}\,\mathbf{y}. \] This simultaneously smooths noise and interpolates between pixels.

#==============================================#
# Smoother-resolution (predict on a 48*48 grid)#
#==============================================#
  library(MASS)  # for ginv function

  # Matern kernel function with nu = 1.5
  matern_kernel <- function(d, rho) {
    return((1 + sqrt(3) * d / rho) * exp(-sqrt(3) * d / rho))}

  # vectorize the image
  img_vector    <- as.vector(img48)                  # convert to vector
  n             <- length(img_vector)                # number of pixels
  coords        <- expand.grid(x = 1:48, y = 1:48)    # training
  dist_matrix   <- as.matrix(dist(coords))           # pairwise distance matrix
  rho           <- 5                                 # length scale parameter
  K             <- matern_kernel(dist_matrix, rho)   # kernel matrix

  lambda        <- 0.1                               # regularization parameter
  alpha         <- ginv(K + lambda * diag(n)) %*% img_vector  # KRR coefficients
  fitted_values <- K %*% alpha                       # fitted pixel values
  
  # fixed values to be in [0, 1]
  fitted_values <- pmin(pmax(fitted_values, 0), 1)
  # reshape fitted values back to image
  smoothed_img  <- matrix(fitted_values, nrow = 48, ncol = 48, byrow = FALSE)
  # plot the smoothed image
  par(mfrow=c(1,2), mar=c(0,0,1,0))
  image(img48, col = gray(256:0/256), zlim=c(0,1),xlab="", ylab="", axes=FALSE, main="Original Image")
  image(smoothed_img, col = gray(256:0/256), zlim=c(0,1), xlab="", ylab="", axes=FALSE, main="Smoothed Image")

#==============================================#
#  Super-resolution (predict on a denser grid) #
#==============================================#
# Target grid, e.g., 128x128 or more (Note: the higher the larger memory required)
coords_star<- expand.grid(u = 1:128, v = 1:128)

# Cross-kernel K* between target and training grids
pairwise_dist <- function(A, B) {
  # A: m*d, B: n*d
  AA <- rowSums(A^2)
  BB <- rowSums(B^2)
  outer(AA, BB, "+") - 2 * (A %*% t(B))
}
D_star   <- sqrt(pmax(pairwise_dist(as.matrix(coords_star), as.matrix(coords)), 0))
K_star   <- matern_kernel(D_star, rho)
y_star   <- pmin(pmax(K_star %*% alpha, 0), 1)
img128   <- matrix(y_star, 128, 128, byrow = FALSE)
image(img128, col = gray(256:0/256), zlim=c(0,1),xlab="", ylab="", axes=FALSE, main="128*128 RKHS upsample")

Question 2: Positive-Definiteness of Kernels [40 pts]

a) Let \(\cal X\) be a set and \(\cal F\) be a Hilbert space. \(\Phi\) is a map from \(\cal X\) to \(\cal F\). If we define a kernel function \(k(\cdot, \cdot)\) as \[k(x, x') = \langle \Phi(x), \Phi(x') \rangle_{\cal F}, \quad \forall x, x' \in \cal X\] show that \(k\) is a positive definite kernel (Hint: use definition).

b) Suppose \(k_1\) and \(k_2\) are two positive definite kernels on \(\cal X\). Show that the kernel \(k = k_1 + k_2\) is also a positive definite kernel.

c) Consider a kernel function as \[k(x, x') = 1\{ | x - x' | < \sigma \}\]is this a positive definite kernel? If yes, prove it. If no, give a counter example.

Solution:

a) Symmetry is immediate: \(k(x,x')=k(x',x)\). For any \(n\in\mathbb N\), points \(x_1,\dots,x_n\in\mathcal X\), and scalars \(c_1,\dots,c_n\),

\[ \sum_{i,j=1}^n c_i c_j k(x_i,x_j) =\sum_{i,j=1}^n c_i c_j \langle \Phi(x_i),\Phi(x_j)\rangle_{\mathcal F} =\Big\langle \sum_i c_i \Phi(x_i), \sum_j c_j \Phi(x_j)\Big\rangle_{\mathcal F} =\Big\|\sum_i c_i \Phi(x_i)\Big\|_{\mathcal F}^2 \;\ge 0. \]

Thus \(k\) is positive semidefinite, i.e., a PD kernel.

b)

Let \(k_1,k_2\) be PD on \(\mathcal X\). Define \(k=k_1+k_2\).For any \(x_1,\dots,x_n\) and \(c_1,\dots,c_n\),

\[ \sum_{i,j} c_i c_j k(x_i,x_j) =\sum_{i,j} c_i c_j k_1(x_i,x_j)+\sum_{i,j} c_i c_j k_2(x_i,x_j)\;\ge\;0+0=0, \]

since each term is \(\ge 0\) by PD of \(k_1\) and \(k_2\). Hence \(k\) is PD.

(Equivalently, Gram\((k)=\)Gram\((k_1)\,+\,\)Gram\((k_2)\), a sum of PSD matrices is PSD.)

c) A counterexample shows the Gram matrix can have a negative eigenvalue.

Take \(x_1=0,\; x_2=1,\; x_3=2\) and \(\sigma=1.5\). Then

\[ |x_1-x_2|=1<1.5,\quad |x_2-x_3|=1<1.5,\quad |x_1-x_3|=2>1.5, \]

so the Gram matrix is

\[ K=\begin{pmatrix} 1&1&0\\ 1&1&1\\ 0&1&1 \end{pmatrix}. \]

Its eigenvalues are \(\{2.414\ldots,\;1,\;-0.414\ldots\}\), so \(K\) is not PSD. Therefore \(k\) is not a PD kernel in general.

Remark (when it is PD): If \(\sigma\) is so small that no two distinct points are within \(\sigma\) (i.e., off-diagonals are all zero), then \(K=I\) for that sample and is PSD. But as a kernel on \(\mathbb R\) (i.e., for all finite samples), the indicator \(1\{|x-x'|<\sigma\}\) is not positive definite.

Question 3: Uniqueness of Kernel Functions [10 pts]

Show that if a reproducing kernel \(k(\cdot, \cdot)\) exists for a Hilbert space \(\cal H \in \mathbb{R}^{\cal X}\), then it is unique. Hint: you may consider assuming that there are two different kernel functions \(k_1\) and \(k_2\) for \(\cal H\), and then show that the Hilbert norm of their difference \(\|k_1(x, \cdot) - k_2(x, \cdot)\|_{\cal H}\) is zero for any \(x \in \cal X\). To do that, expand this equation and use properties of a Hilbert space and the reproducing property.

Solution:

Let \(\mathcal H\subset \mathbb R^{\mathcal X}\) be a Hilbert space of functions on \(\mathcal X\). Assume \(k_1\) and \(k_2\) are both reproducing kernels for \(\mathcal H\). Write \(k_{r,x} := k_r(x,\cdot)\in\mathcal H\) for \(r\in\{1,2\}\).

By the reproducing property, for every \(f\in\mathcal H\) and every \(x\in\mathcal X\),

\[ \langle f, k_{1,x}\rangle_{\mathcal H}=f(x)=\langle f, k_{2,x}\rangle_{\mathcal H}. \]

Hence for all \(f\in\mathcal H\),

\[ \langle f,\, k_{1,x}-k_{2,x}\rangle_{\mathcal H}=0. \]

So \(k_{1,x}-k_{2,x}\) is orthogonal to every vector in \(\mathcal H\). In a Hilbert space, the only vector orthogonal to all of \(\mathcal H\) is the zero vector; equivalently,

\[ \|k_{1,x}-k_{2,x}\|_{\mathcal H}^2 =\langle k_{1,x}-k_{2,x},\,k_{1,x}-k_{2,x}\rangle_{\mathcal H} =0. \]

Therefore \(k_{1,x}=k_{2,x}\) for every \(x\in\mathcal X\). Taking inner products with \(k_{1,y}=k_{2,y}\) gives, for all \(x,y\),

\[ k_1(x,y)=\langle k_{1,x},k_{1,y}\rangle_{\mathcal H} =\langle k_{2,x},k_{2,y}\rangle_{\mathcal H}=k_2(x,y). \]

Thus the reproducing kernel for \(\mathcal H\) is unique.

(Equivalent viewpoint: for each fixed \(x\), the evaluation functional \(L_x:f\mapsto f(x)\) is continuous; by the Riesz representation theorem it has a unique representer \(k_x\in\mathcal H\). Hence uniqueness of the whole kernel.)

Question 4: Kernel Logistic Regression [25 pts]

By the same approach as kernel ridge regression, we can also use the kernel method to do logistic regression. For a logistic regression with linear predictor, the -2 log-likelihood function is given by

\[ -2 \rho(\beta) = -2 \sum_{i=1}^n \left[ y_i x_i^T \beta - \log(1 + \exp(x_i^T \beta)) \right], \] where we used \(x_i^T \beta\) to model the effect. Now we know that this can be changed to \(f(x_i)\) for some function \(f \in \cal H\). And all we need to do is to specify a kernel function, and also add a penalty term \(\lambda \|f\|_{\cal H}^2\). So the optimization problem becomes

\[ \widehat\alpha = \arg\min_{\alpha \in \mathbb{R}^n} \left\{ -2 \sum_{i=1}^n \left[ y_i K_i^T \alpha - \log(1 + \exp(K_i^T \alpha)) \right] + \lambda \alpha^T K \alpha \right\}, \]

where \(K_i\) is the \(i\)-th row in the kernel matrix \(K\). The analytic solution does not exist, but you can define this loss function and throw it into the optim() function in R to get the solution. For this question, use the following training and testing data from the handwritten digit dataset with digits 5 and 6 only. For this question, you should:

Define your own kernel function \(K(\cdot, \cdot)\) on two images. You may consider a radial basis function (RBF) kernel, and can pick any reasonable value for \(\sigma\). Again, you do not need to optimize \(\sigma\). Just pick a reasonable value as long as your result is reasonable. Write your own loss function as shown above, and use optim() to get the solution of the coefficients.Use the fitted model to predict the testing data, and report the testing accuracy. Remember, this is a classification problem, so you should report a contingency table. You do not need to optimize \(\lambda\), but pick a reasonable value and see how it affects the results.

#================#
# LOAD THE DATA  #
#================#
# get the first 100 observations with digits 5 and 6
digits5 <- zip.train[zip.train[,1] == 5, ]
digits5 <- digits5[1:50, ]
digits6 <- zip.train[zip.train[,1] == 6, ]
digits6 <- digits6[1:50, ]

train56 <- rbind(digits5, digits6)

# get the testing data 
digits5_test <- zip.test[zip.test[,1] == 5, ]
digits5_test <- digits5_test[1:50, ]
digits6_test <- zip.test[zip.test[,1] == 6, ]
digits6_test <- digits6_test[1:50, ]

test56 <- rbind(digits5_test, digits6_test)

Solution:

#========================================#
# kernel logistic regression (dual form) #
#========================================#
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# 1) Prep: y in {0,1}, X as images #
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
get_xy <- function(mat) {
  y <- as.integer(mat[,1] == 6)       
  X <- as.matrix(mat[,-1, drop=FALSE])
  list(X=X, y=y)
}
tr <- get_xy(train56)
te <- get_xy(test56)

n  <- nrow(tr$X)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
#   2) RBF kernel + bandwidth      #
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
rbf_kernel <- function(A, B, sigma) {
  # returns |A|x|B| Gram with K_ij = exp(-||A_i - B_j||^2/(2 sigma^2))
  # A: m*d, B: n*d
  AA <- rowSums(A^2)
  BB <- rowSums(B^2)
  D2 <- outer(AA, BB, "+") - 2 * (A %*% t(B))
  exp(- D2 / (2*sigma^2))
}

# "median" for sigma on a small subset 
median_sigma <- function(X, m = min(300, nrow(X))) {
  idx <- sample(seq_len(nrow(X)), m)
  XX  <- X[idx, , drop=FALSE]
  AA <- rowSums(XX^2); D2 <- outer(AA, AA, "+") - 2*(XX %*% t(XX))
  d  <- sqrt(pmax(D2[upper.tri(D2)], 0))
  median(d)
}

set.seed(546)
sigma  <- median_sigma(tr$X)          
lambda <- 0.01                        #  larger = stronger reg

K      <- rbf_kernel(tr$X, tr$X, sigma)
K      <- (K + t(K))/2                # symmetrize
K      <- K + 1e-8*diag(n)            # tiny jitter for stability

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# 3) Loss and gradient (dual a)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Logistic pieces
softplus <- function(z) ifelse(z > 35, z, log1p(exp(z)))  # stable log(1+e^z)
sigmoid  <- function(z) 1/(1+exp(-z))

# Objective
klogit_obj <- function(alpha, y, K, lambda) {
  f    <- as.numeric(K %*% alpha)
  nll2 <- -2 * sum(y * f - softplus(f))
  pen  <- lambda * as.numeric(t(alpha) %*% K %*% alpha)
  nll2 + pen
}

# Gradient
klogit_grad <- function(alpha, y, K, lambda) {
  f <- as.numeric(K %*% alpha)
  p <- sigmoid(f)
  -2 * (t(K) %*% (y - p)) + 2 * lambda * (K %*% alpha)
}

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# 4) Optimize with BFGS
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
alpha0 <- rep(0, n)
fit <- optim(
  par     = alpha0,
  fn      = klogit_obj,
  gr      = klogit_grad,
  method  = "BFGS",
  control = list(maxit = 200, reltol = 1e-8),
  y       = tr$y, K = K,
  lambda  = lambda
)
alpha_hat <- fit$par

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# 5) Train-set sanity check
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
f_tr   <- as.numeric(K %*% alpha_hat)
p_tr   <- sigmoid(f_tr)
yh_tr  <- as.integer(p_tr > 0.5)
cat("Train accuracy:", mean(yh_tr == tr$y), "\n")
## Train accuracy: 1
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# 6) Test prediction & accuracy
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
K_star <- rbf_kernel(te$X, tr$X, sigma)        # test-vs-train
f_te   <- as.numeric(K_star %*% alpha_hat)
p_te   <- sigmoid(f_te)
yh_te  <- as.integer(p_te > 0.5)

# Confusion matrix & accuracy
tbl <- table(Predicted = yh_te, Actual = te$y)
acc <- mean(yh_te == te$y)
print(tbl)
##          Actual
## Predicted  0  1
##         0 49  2
##         1  1 48
cat(sprintf("Test accuracy: %.3f\n", acc))
## Test accuracy: 0.970