Stat 546 Homework 2

Instruction
Question 1: Kernel Functions [10 pts]
Question 2: Weighted SVM [30 pts]
Question 3: Kernel PCA [15 pts]
Question 4: MMD [15 pts]

Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

You are required to submit the rendered file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file. .html format cannot be accepted since they may not be readable in Gradescope. All proofs must be typed in LaTeX format. Make all of your R code chunks visible for grading.
Include your Name and NetID in the report.
If you use this file or the example homework .Rmd file as a template, be sure to remove this instruction section.
On random seed and reproducibility: You should use R version \(\geq 4.0.0\). This will ensure your random seed generation is the same as everyone else. Please note that updating the R version may require you to reinstall all of your packages. In the markdown file, set seed properly so that the results can be reproduced.
For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.
Using AI tools: you should provide a brief statement on the involvement of AI tools on your homework.

Question 1: Kernel Functions [10 pts]

In our lecture, we showed that the kernel function

\[ k(x, y) = (x^T y)^2 \]

achieves the feature mapping that contains all second order interaction terms. Use this to show that the kernel function

\[ k(x, y) = (x^T y + 1)^2 \]

where \(c > 0\) is a constant, achieves the feature mapping that contains all second order terms plus all first order terms and an intercept term.

Solution:

In general for \(k(x, y) = (x^T y + c)^2\) we can simply define

\[ \psi(x) = (\sqrt{c}, x) \]

For any \(x \in \mathbb{R}^d\). Then the kernel function is the quadratic kernel on \(x^\ast\), which includes all second order terms, first order terms, and an intercept term. For above case we can set \(c=1\). In particular: \[ (x^\top y + c)^2 \;=\; \big([\sqrt{c};\,x]^\top[\sqrt{c};\,y]\big)^2 \]

So with \(\psi(x)=(\sqrt{c},x)\) and \(\psi(y)=(\sqrt{c},y)\), the kernel is the homogeneous quadratic kernel \(K(\tilde x,\tilde y)=(\tilde x^\top \tilde y)^2\) in the augmented space. Expanding gives:

\[ (x^\top y + c)^2 = \underbrace{(x^\top y)^2}_{\text{all }x_ix_jy_iy_j} \;+\; 2c\,\underbrace{x^\top y}_{\text{all linear }x_i y_i} \;+\; c^2\ \underbrace{(1)}_{\text{intercept}}, \]

Question 2: Weighted SVM [30 pts]

In a paper called Outcome Weighted Learning by Zhao et al. (2012) proposed to use the following objective function (see Equation 2.2 in the paper) to estimate optimal treatment regimes. This is a weighted support vector machine classification problem.

\[ \min_{f \in \mathcal{H}_K} \frac{1}{n} \sum_{i=1}^n w_i \big(1 - A_i f(x_i)\big)_+ + \lambda ||f||^2_K \]

where \(w_i\) is some observation weight, \(A_i \in \{-1, 1\}\) is a binary outcome, and \(\mathcal{H}_K\) is a RKHS with kernel function \(K\).

a. [10 pts] Note that this is corresponding to a penalized version of the SVM. Please provide the primal form of the Lagrangian corresponding SVM optimization problem with a linear kernel.

Solution:

The first term in the objective function is the empirical hinge loss function corresponding to the slack variables \(\xi_i = (1 - A_i f(x_i))_+\). And the second term corresponds to the norm term \(\frac{1}{2} \| \boldsymbol \beta \|_2^2\). Therefore, the primal form of the corresponding SVM optimization problem with a linear kernel is

\[ \begin{aligned} \min_{\boldsymbol \beta, \beta_0, \boldsymbol \xi} & \quad \frac{1}{2} \| \boldsymbol \beta \|_2^2 + C \sum_{i=1}^n w_i \xi_i \\ \text{s.t.} & \quad A_i (\boldsymbol \beta^T x_i + \beta_0) \geq 1 - \xi_i, \quad i = 1, \ldots, n \\ & \quad \xi_i \geq 0, \quad i = 1, \ldots, n \end{aligned} \]

The Lagrangian function for this problem is given by

\[ \begin{aligned} L(\boldsymbol \beta, \beta_0, \boldsymbol \xi, \boldsymbol \alpha, \boldsymbol \gamma) = & \frac{1}{2} \| \boldsymbol \beta \|_2^2 + C \sum_{i=1}^n w_i \xi_i - \sum_{i=1}^n \alpha_i [A_i (\boldsymbol \beta^T x_i + \beta_0) - 1 + \xi_i] - \sum_{i=1}^n \gamma_i \xi_i \end{aligned} \]

where \(\boldsymbol \alpha = (\alpha_1, \ldots, \alpha_n)\) and \(\boldsymbol \gamma = (\gamma_1, \ldots, \gamma_n)\) are the Lagrange multipliers associated with the constraint. The primal form is

\[ \min_{\boldsymbol \beta, \beta_0, \boldsymbol \xi} \max_{\boldsymbol \alpha > 0, \boldsymbol \gamma > 0} \quad L(\boldsymbol \beta, \beta_0, \boldsymbol \xi, \boldsymbol \alpha, \boldsymbol \gamma) \]

The optimization of this problem is almost equivalent to the standard SVM, except that the slack variables \(\xi_i\) are weighted by \(w_i\).

b. [20 pts] Please provide the dual form of the corresponding problem, and derive the final optimization problem based on the dual.

Solution:

To derive the dual form, we first take the partial derivatives of the Lagrangian with respect to the primal variables \(\boldsymbol \beta\), \(\beta_0\), and \(\boldsymbol \xi\), and set them to zero.

\[ \begin{aligned} \frac{\partial L}{\partial \boldsymbol \beta} &= \boldsymbol \beta - \sum_{i=1}^n \alpha_i A_i x_i = 0 &\implies& \boldsymbol \beta = \sum_{i=1}^n \alpha_i A_i x_i \\ \frac{\partial L}{\partial \beta_0} &= -\sum_{i=1}^n \alpha_i A_i = 0 &\implies& \sum_{i=1}^n \alpha_i A_i = 0 \\ \frac{\partial L}{\partial \xi_i} &= C w_i - \alpha_i - \gamma_i = 0 &\implies& \alpha_i + \gamma_i = C w_i \\ \end{aligned} \]

Substituting these conditions back into the Lagrangian, we obtain the dual form:

\[ \begin{aligned} L_D(\boldsymbol \alpha) = & \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j A_i A_j x_i^T x_j \\ \text{s.t.} & \quad 0 \leq \alpha_i \leq C w_i, \quad i = 1, \ldots, n \\ & \quad \sum_{i=1}^n \alpha_i A_i = 0 \end{aligned} \]

c. [30 pts] Take the mixture.example dataset in the ElemStatLearn package, and use the following code to define the weights:

library(ElemStatLearn)
data(mixture.example)
# the grid on x1 and x2
px1 = mixture.example$px1
px2 = mixture.example$px2
# plot the data and true decision boundary
prob <- mixture.example$prob
prob.bayes <- matrix(prob, 
                     length(px1), 
                     length(px2))
contour(px1, px2, prob.bayes, levels=0.5, lty=2, 
        labels="", xlab="x1",ylab="x2",
        main="True Boundary", col = "red", lwd = 2)
points(mixture.example$x, col=ifelse(mixture.example$y==1, "darkorange", "deepskyblue"), pch = 19)

# define the new weights
w = rep(1, nrow(mixture.example$x))
w[mixture.example$x[,1] < 0.8 & mixture.example$x[,2] > 1 & mixture.example$y == 0] = 10  

# plot the data and visualized the ones with large weights
plot(mixture.example$x, col = ifelse(mixture.example$y == 1, "darkorange", "deepskyblue"), 
     cex = 0.5, pch = 19)
points(mixture.example$x[w == 10, ], col = "darkgreen", pch = 19)

Explore the quadprog package in R to solve the dual problem you derived in part (b). For this question, change the inner product \(x_i^T x_j\) to a polynomial kernel with degree 4, and make sure to add an intercept term before applying the kernel function. The solve.QP() function in this function is specified as solve.QP(Dmat, dvec, Amat, bvec, meq = 0), where

Dmat is the quadratic coefficient matrix. The matrix must be positive definite. You can add a small value (e.g., 1e-5) to the diagonal of \(D\) to avoid numerical errors.
dvec is the linear term vector. The optimization problem is defined as \(\min \frac{1}{2} {\boldsymbol \alpha}^T D {\boldsymbol \alpha} - d^T \boldsymbol \alpha\).
Amat and bvec define the constraints such that \(A^T \alpha \geq b\). Note that the constraints are defined in terms of \(A^T\) instead of \(A\).
meq is the number of equality constraints so that the first meq constraints in \(A^T \alpha \geq b\) are treated as equality constraints. The rest are treated as one-sided inequality constraints as you specified. For more details, you should read the documentation of the solve.QP() function. Perform the following
Investigate your dual form in the previous question and consider the vector \(\boldsymbol \alpha\) as the parameter to be optimized, then re-write the dual form (and constrains) as linear and quadratic terms of \(\boldsymbol \alpha\) with one-sided and exact equality constraints, so that they can be represented in the form required by the solve.QP() function.
Solve the dual problem using the solve.QP() function, and visualize the decision boundary. You may use the previous code chunk to visualize the fitted decision boundary by define your own prob.bayes matrix. The contour() function will help you visualize the decision boundary.
You will need to compute the intercept term \(b_0\) in the decision function \(f(x) = \sum_{i=1}^n \alpha_i y_i K(x_i, x) + b_0\). Although we did not discuss how to compute \(b_0\) in the lecture, the result is actually following the KKT condition, but with simple formula as \[ b_0 = \text{median}_{i \in \mathcal{S}} \left( y_i - \sum_{j=1}^n \alpha_j y_j K(x_j, x_i) \right) \] where \(\mathcal{S}\) is the set of indices of support vectors that are on the margin, i.e., \(0 < \alpha_i < C w_i\).
It should be expected that you will get a different decision boundary compared to the truth, but is it reasonable? Please provide a brief explanation.

Solution:

# ------------------------------------------------------------
# Weighted kernel SVM via dual (quadprog) on ElemStatLearn::mixture.example
# Kernel: polynomial degree 3 with intercept term: K(x,z) = (1 + x^T z)^3
# Dual:
#   minimize (1/2) alpha^T D alpha - 1^T alpha
#   subject to:
#     y^T alpha = 0
#     0 <= alpha_i <= C * w_i
# where D = (y y^T) ⊙ K
# ------------------------------------------------------------
library(ElemStatLearn)
library(quadprog)

# --- Data and Bayes boundary (for reference plot) ---
data(mixture.example)
X   <- mixture.example$x              # n x 2
y01 <- mixture.example$y             # 0/1 labels
y   <- ifelse(y01 == 1,  1, -1)       # convert to {-1, +1}
n   <- nrow(X)

px1  <- mixture.example$px1
px2  <- mixture.example$px2
prob <- mixture.example$prob
prob.bayes <- matrix(prob, length(px1), length(px2))

# --- Define instance weights (your rule) ---
w <- rep(1, n)
w[X[,1] < 0.8 & X[,2] > 1 & y01 == 0] <- 10

# --- Hyperparameters ---
Cval  <- 1.0               # C parameter (feel free to adjust)
eps   <- 1e-8              # tolerance for support vectors
ridge <- 1e-5             # tiny diagonal add for numerical PD

# --- Degree-4 polynomial kernel with intercept term ---
# Equivalent to augmenting x with a leading 1 before the dot-product.
kpoly_4 <- function(A, B) {
  (1 + A %*% t(B))^4
}

# --- Build kernel matrix and D = (y y^T) ⊙ K ---
K <- kpoly_4(X, X)
Youter <- outer(y, y)     # y y^T
D <- K * Youter
D <- D + diag(ridge, n)   # ensure positive-definiteness for QP

# --- QP components for quadprog::solve.QP
# We solve: min (1/2) alpha^T D alpha - d^T alpha
# with constraints A^T alpha >= b; first meq constraints are equalities.
dvec <- rep(1, n)

# Constraints:
# 1) Equality: y^T alpha = 0  -> Aeq = y, beq = 0
Aeq <- matrix(y, nrow = n, ncol = 1)     # n x 1
beq <- 0

# 2) Inequalities for box constraints 0 <= alpha_i <= C * w_i
#    alpha >= 0       -> I^T alpha >= 0  (quadprog uses A^T alpha >= b)
#    -alpha >= -C w   -> (-I)^T alpha >= -C w  i.e., alpha <= C w
A_ge0 <- diag(n)                          # n x n
b_ge0 <- rep(0, n)

A_leCw <- -diag(n)                        # n x n
b_leCw <- -Cval * w

# Combine: Amat is n x (1 + 2n); bvec length 1 + 2n
Amat <- cbind(Aeq, A_ge0, A_leCw)         # (n x (1+2n))
bvec <- c(beq, b_ge0, b_leCw)

# Number of equalities:
meq <- 1

# --- Solve QP ---
qp_fit <- solve.QP(Dmat = D, dvec = dvec, Amat = Amat, bvec = bvec, meq = meq)
alpha <- qp_fit$solution

# --- Identify support vectors and compute intercept b0
at_lower     <- alpha < eps
at_upper     <- (Cval * w - alpha) < eps
is_SV_margin <- (!at_lower) & (!at_upper)  # 0 < alpha_i < C w_i

# Decision function without intercept: f0(z) = sum_i alpha_i y_i K(x_i, z)
f0 <- function(Z) {
  Kxz <- kpoly_4(X, Z)             # n x m
  as.numeric(t(alpha * y) %*% Kxz) # 1 x m -> numeric vector
}

# Intercept via KKT on margin SVs; if none, use all alpha_i > eps as fallback
idx <- which(is_SV_margin)
if (length(idx) == 0L) idx <- which(alpha > eps)

# b0 = median over chosen SVs of [ y_i - sum_j alpha_j y_j K(x_j, x_i) ]
b_candidates <- y[idx] - f0(X[idx, , drop = FALSE])
b0 <- median(b_candidates)

# Final decision function
f <- function(Z) f0(Z) + b0
pred <- function(Z) ifelse(f(Z) >= 0, 1, -1)

# --- Visualization: decision boundary on provided grid (level = 0) ---
# The mixture.example grid order matches matrix(prob, length(px1), length(px2)).
# We'll compute f over the same grid ordering and reshape.
grid <- expand.grid(px1, px2)  # rows vary px1 fastest by default
f_grid <- f(as.matrix(grid))
f_mat <- matrix(f_grid, nrow = length(px1), ncol = length(px2))

# --- Plots: Bayes boundary (dashed) and learned boundary (solid) ---
par(mfrow = c(1, 1))
contour(px1, px2, prob.bayes, levels = 0.5, lty = 2,
        labels = "", xlab = "x1", ylab = "x2",
        main = "Weighted SVM (poly-4 kernel) vs Bayes boundary",
        col = "red", lwd = 2)
points(X, col = ifelse(y01 == 1, "darkorange", "deepskyblue"), pch = 19)
# our learned boundary at f=0
contour(px1, px2, f_mat, levels = 0, add = TRUE, lwd = 2)

# Optionally highlight heavy-weighted points
points(X[w == max(w), , drop = FALSE], pch = 1, cex = 1.2, lwd = 2)
legend("bottomleft",
       legend = c("Bayes (0.5)", "SVM boundary", "class 1", "class 0", "high weight"),
       col    = c("red", "black", "darkorange", "deepskyblue", "black"),
       lty    = c(2, 1, NA, NA, NA),
       pch    = c(NA, NA, 19, 19, 1),
       lwd    = c(2, 2, NA, NA, 2),
       bty = "n")

d. [20 pts] Instead, you can use our understanding of the representer theorem to formulate the primal form. Note that it would be simply the original statement of the optimization problem at the beginning of this question, which is an unconstrained optimization problem. Please provide the form of your finite dimensional optimization problem. Still use the polynomial kernel with degree 4. You can use any optimization functions in R (e.g., optim()) to directly solve the primal problem. In this case, what is your decision function? Visualize the decision boundary again. Is it the same as the one you obtained from the dual form? Please provide a brief discussion of your result.

Solution: By the representer theorem, the RKHS minimizer can be written as

\[ f(x)=\sum_{j=1}^n \beta_j K(x,x_j)+b_0, \]

so with our degree-4 polynomial kernel \(K(x,z)=(1+x^\top z)^4\), the finite-dimensional problem in \((\beta,b_0)\) becomes \[ \min_{\beta\in\mathbb{R}^n,;b_0\in\mathbb{R}} \frac{1}{n}\sum_{i=1}^n w_i\bigl(1-y_i[(K\beta)_i+b_0]\bigr)_+ +\lambda,\beta^\top K\beta \] where \(K\in\mathbb{R}^{n\times n}\) is the Gram matrix with \(K_{ij}=K(x_i,x_j)\) and \((K\beta)_i=\sum_j K_{ij}\beta_j\). To match the usual soft-margin parameter \(C\) used in the dual, we set

\[ \lambda=\frac{1}{2nC} \]

In the code, \(K\) is built via \(K(X,X)=(1+X X^\top)^4\), the weights \(w\) are set to 10 for selected class-0 points and 1 otherwise, and we minimize the weighted average hinge loss plus the RKHS penalty with L-BFGS-B. For numerical stability (mirroring the dual’s tiny ridge on \(D\)), we optionally use \(K_r=K+\delta I\) with \(\delta=10^{-5}\) only in the penalty:

\[ \text{obj}(\beta,b_0)=\frac{1}{n}\sum_i w_i\bigl(1-y_i[(K\beta)_i+b_0]\bigr)_+ + \lambda,\beta^\top K_r\beta \]

The fitted decision function used for prediction and plotting is

\[ \hat f(x)=\sum_{j=1}^n \hat\beta_j K(x,x_j)+\hat b_0, \]

implemented as f_primal(Z) = kpoly4(Z, X) %*% beta_hat + b0_hat. We then evaluate \(\hat f\) on the plotting grid \((px1,px2)\) and draw the decision boundary with contour(..., levels = 0). In the code, the matrix \(Fmat\) is reshaped so that its dimensions match contour(px1, px2, Fmat).

This primal solution matches the dual (up to small numerical differences) when:

\(C\) and \(\lambda\) are linked by \(\lambda=1/(2nC)\),
the same kernel and weights are used, and
the grid reshape is consistent with \((px1,px2)\).

# -------------------------------------------------------------------
# Weighted SVM via the representer theorem (primal , poly-4 kernel)
# -------------------------------------------------------------------
X   <- mixture.example$x            # n × 2 input matrix
y01 <- mixture.example$y            # {0,1} labels
y   <- ifelse(y01 == 1, +1, -1)     # convert to {+1,−1}
n   <- nrow(X)

w   <- rep(1, n)# define weights: upweight class-0 points in region
w[X[,1] < 0.8 & X[,2] > 1 & y01 == 0] <- 10

# Build the degree-4 polynomial kernel with intercept
kpoly4 <- function(A, B) {
  (1 + A %*% t(B))^4}
K      <- kpoly4(X, X)   # n×n Gram matrix

# regularization parameter λ
n      <- nrow(X)
Cval   <- 1.0     # whatever you used in the dual 
lambda <- 1/(2 * n * Cval)
# Define a smooth approximation to the hinge loss and its derivative
smooth_hinge <- function(m, tau = 1e-1) {
  # m = y * f; tau = smoothing parameter
  u    <- 1 - m
  loss <- ifelse(u <= 0,0, ifelse(u < tau, u^2/(2*tau), u - tau/2))
  dLdm <- ifelse(u <= 0,0, ifelse(u < tau, u/tau, 1))
  list(loss = loss, grad = dLdm)
}

# Objective + gradient for optim()
#    θ = c(beta (length n), b₀ (scalar))
obj_and_grad <- function(theta) {
  beta  <- theta[1:n]
  b0    <- theta[n+1]
  # compute f_i = (K %*% beta)_i + b0
  f     <- as.numeric(K %*% beta) + b0
  # compute smooth hinge on margins m_i = y_i * f_i
  K_r   <- K + diag(1e-5, n)
  sh    <- smooth_hinge(y * f)
  # objective: weighted avg. loss + RKHS penalty
  obj  <- (1/n) * sum(w * sh$loss) + lambda * as.numeric(t(beta) %*% K_r %*% beta)
  # gradient w.r.t. f: dℓ/df = −y * dLdm
  dL_df <- - (y * sh$grad) * w / n
  
  # chain rule → grads
  grad_b0 <- sum(dL_df)               # ∂/∂b0
  grad_beta <- 2 * lambda * (K_r %*% beta) + (K %*% dL_df)
  grad <- c(as.numeric(grad_beta), grad_b0)
  attr(obj, "gradient") <- grad
  obj
}

# Run L-BFGS optimization
init_theta <- c(alpha * y, b0)
fit        <- optim(init_theta,
             fn          = obj_and_grad,
             gr          = NULL,      # gradient is returned via attr(obj,"gradient")
             method      = "L-BFGS-B",
             control     = list(maxit = 1000, factr = 1e7, pgtol = 1e-5),
             hessian     = FALSE)

beta_hat   <- fit$par[1:n]
b0_hat     <- fit$par[n+1]

# 7. Define decision function
f_primal <- function(Z) {
  # Z: m×2 matrix of new points
  as.numeric(kpoly4(Z, X) %*% beta_hat + b0_hat)
}

# Plot the decision boundary vs. true Bayes boundary
px1        <- mixture.example$px1
px2        <- mixture.example$px2
prob.bayes <- matrix(mixture.example$prob,length(px1), length(px2))

# compute f on grid
grid  <- expand.grid(px1, px2)
fvals <- f_primal(as.matrix(grid))
Fmat  <- matrix(fvals,  nrow=length(px1),ncol=length(px2),byrow=FALSE)

# plot
prob.bayes <- matrix(mixture.example$prob, length(px1), length(px2))
contour(px1, px2, prob.bayes, levels = 0.5, lty = 2,
        col = "red", lwd = 2,
        xlab = "x1", ylab = "x2",
        main = "Primal Weighted SVM (poly-4 kernel)")
points(X, col = ifelse(y01==1,"darkorange","deepskyblue"), pch = 19)
contour(px1, px2, Fmat, levels = 0, add = TRUE, lwd = 2)
legend("bottomleft",
       legend = c("Bayes (0.5)", "Primal SVM boundary", "class 1","class 0","high weight"),
       col    = c("red","black","darkorange","deepskyblue","deepskyblue"),
       lty    = c(2,1,NA,NA,NA),
       pch    = c(NA,NA,19,19,1),
       lwd    = c(2,2,NA,NA,2),
       bty    = "n")

Question 3: Kernel PCA [15 pts]

The following code is for PCA of the hand written digits dataset in the ElemStatLearn package.

  library(ElemStatLearn)
  data(zip.train)
  
  # PCA
  pca <- prcomp(zip.train[, -1], center=TRUE, scale.=FALSE)
  
  # plot the first two PCs
  plot(pca$x[,1], pca$x[,2], col=zip.train[,1]+1, pch=19, cex = 0.5,
       xlab="First Principal Component", ylab="Second Principal Component")

The goal of this question is to implement kernel PCA by yourself. You can use any matrix decomposition functions in R (e.g., eigen(), svd(), etc.), but you should most of the main steps by yourself. Consider the following:

Use a Gaussian kernel with bandwidth \(\sigma\). You have to choose your own bandwidth but it does NOT need to be optimized.
Randomly select 50 images from each digit (0-9) to form a dataset with 500 images.
Write your own code ot perform kernel PCA on the dataset, and visualize the first two kernel principal components.
Is the result better than the linear PCA? Please provide a brief explanation.
If you are asked to choose a better kernel (possibly define your own kernel function) to improve the result, what would you do? Is your kernel function positive definite? Please provide a brief explanation, and also plot the first two kernel principal components based on your new kernel. Your result does not need to be better than the Gaussian kernel, but you should provide a reasonable explanation of why you think your kernel is good.

Solution:

  library(ElemStatLearn)
  set.seed(123)
  data(zip.train)
  # randomly select 20 images from each digit
  selected_indices <- unlist(lapply(0:9, function(d) {
    sample(which(zip.train[, 1] == d), 50)
  }))
  
  data_subset <- zip.train[selected_indices, -1]
  n <- nrow(data_subset)

  # Gaussian kernel function
  gaussian_kernel <- function(x, y, sigma) {
    exp(-sum((x - y)^2) / (2 * sigma^2))
  }

  # compute a median distance as the bandwidth
  dist_matrix <- as.matrix(dist(data_subset))
  median_dist <- median(dist_matrix[upper.tri(dist_matrix)])
  sigma <- median_dist
  
  # Compute the kernel matrix  
  K <- matrix(0, n, n)
  for (i in 1:n) {
    for (j in 1:n) {
      K[i, j] <- gaussian_kernel(data_subset[i, ], data_subset[j, ], sigma)
    }
  }
  
  # Center the kernel matrix
  one_n <- matrix(1/n, n, n)
  K_centered <- K - one_n %*% K - K %*% one_n + one_n %*% K %*% one_n
  # Eigen decomposition
  eig <- eigen(K_centered)
  # Get the top two eigenvectors
  pc1 <- eig$vectors[, 1]
  pc2 <- eig$vectors[, 2]
  # Plot the first two kernel principal components
  plot(pc1, pc2, col=zip.train[selected_indices, 1] + 1, pch=19, cex=0.5,
       xlab="First Kernel Principal Component", ylab="Second Kernel Principal Component",
       main="Kernel PCA with Gaussian Kernel")
  
  legend("topright", legend=0:9, col=2:11, pch=19, bty="n")

This result is not really better than the linear PCA. The reason could be that the Gaussian kernel can be affected by the dimensionality of the data. The images are 256-dimensional, and the Gaussian kernel may not effectively capture the structure in such high-dimensional space. I am going to cheat a bit by doing this adaptively. First, I get the top 100 pixels by their variance across different digits.

  # each pair of digits, I find the pixel that is most different
  pixel_select = matrix(NA, nrow=10, ncol=10)

  for (i in 1:10) {
    for (j in 1:10) {
      if (i < j)
        pixel_select[i, j] = which.max(apply(data_subset[zip.train[selected_indices, 1] == (i-1), ] - 
                                              data_subset[zip.train[selected_indices, 1] == (j-1), ], 2, var))
    }
  }

  # get all pixels that are selected
  selected_pixels = unique(as.vector(pixel_select))
  selected_pixels = selected_pixels[!is.na(selected_pixels)]
  
  data_subset2 = data_subset[, selected_pixels]
  
  # perform kernel PCA again
  n <- nrow(data_subset2)
  # compute a median distance as the bandwidth
  dist_matrix <- as.matrix(dist(data_subset2))
  median_dist <- median(dist_matrix[upper.tri(dist_matrix)])
  sigma <- median_dist/2.1
  
  # Compute the kernel matrix
  K <- matrix(0, n, n)
  for (i in 1:n) {
    for (j in 1:n) {
      K[i, j] <- gaussian_kernel(data_subset2[i, ], data_subset2[j, ], sigma)
    }
  }
  
  # Center the kernel matrix
  one_n <- matrix(1/n, n, n)
  K_centered <- K - one_n %*% K - K %*% one_n + one_n %*% K %*% one_n
  # Eigen decomposition
  eig <- eigen(K_centered)
  # Get the top two eigenvectors
  pc1 <- eig$vectors[, 1]
  pc2 <- eig$vectors[, 2]
  
  # Plot the first two kernel principal components
  plot(pc1, pc2, col=zip.train[selected_indices, 1] + 1, pch=19, cex=0.5,
       xlab="First Kernel Principal Component", ylab="Second Kernel Principal Component",
       main="Kernel PCA with Gaussian Kernel (selected pixels)")
  legend("topright", legend=0:9, col=2:11, pch=19, bty="n")

Its probably not improving the result by too much, but I think this is a reasonable kernel. The reason is that the kernel is still positive definite since it is still a Gaussian kernel, just on a smaller set of the features. And the selected pixels are the ones that are most different across different digits, so they should be more informative than the other pixels.

Question 4: MMD [15 pts]

In your previous question, what are the two most difficult digits to separate? You can use the kernel mean embedding and maximum mean discrepancy (MMD) to test whether these two digits are from the same distribution.

[5 pts] Select 100 random images from each of these two digits, and compute the MMD statistic based on a Gaussian kernel. You can use the same bandwidth as in your previous question. Please provide your own code for the MMD statistic.
[5 pts] Use the permutation test to compute the p-value of the MMD statistic. You can use 1000 permutations. Is the p-value small enough to reject the null hypothesis that these two digits are from the same distribution? Please provide a brief summary of your findings.
[5 pts] Use the new kernel that you defined in the previous question to compute the MMD statistic and p-value again. Is the p-value smaller than the one based on the Gaussian kernel? Please provide a brief discussion of your result.

Solution:

# generally we want to Replace dA/dB with the two digits found hardest to separate in Q3 (e.g., 3 and 5).
set.seed(546)
library(ElemStatLearn)
data(zip.train)

dA <- 3; dB <- 5   # <-- set to your two hardest digits

A_idx <- which(zip.train[,1] == dA)
B_idx <- which(zip.train[,1] == dB)

A_samp <- sample(A_idx, 100)
B_samp <- sample(B_idx, 100)

X1 <- as.matrix(zip.train[A_samp, -1])  # 100 x p  (digit dA)
X2 <- as.matrix(zip.train[B_samp, -1])  # 100 x p  (digit dB)

# bandwidth: reuse your Q3 sigma; if not in scope, recompute from X1???X2
if (!exists("sigma")) {
  XY <- rbind(X1, X2)
  D  <- as.matrix(dist(XY))
  sigma <- median(D[upper.tri(D)])
}

# fast Gaussian Gram blocks
gk_blocks <- function(X, Y, sigma) {
  A2 <- rowSums(X^2); B2 <- rowSums(Y^2)
  Kxx <- exp(-(outer(A2, A2, "+") - 2 * (X %*% t(X))) / (2*sigma^2))
  Kyy <- exp(-(outer(B2, B2, "+") - 2 * (Y %*% t(Y))) / (2*sigma^2))
  Kxy <- exp(-(outer(A2, B2, "+") - 2 * (X %*% t(Y))) / (2*sigma^2))
  diag(Kxx) <- 0; diag(Kyy) <- 0  # U-stat: remove diagonals
  list(Kxx=Kxx, Kyy=Kyy, Kxy=Kxy)
}

mmd2_unbiased <- function(X, Y, sigma) {
  n <- nrow(X); m <- nrow(Y)
  Ks <- gk_blocks(X, Y, sigma)
  term_xx <- sum(Ks$Kxx) / (n*(n-1))
  term_yy <- sum(Ks$Kyy) / (m*(m-1))
  term_xy <- mean(Ks$Kxy)
  term_xx + term_yy - 2*term_xy
}

mmd2_obs_gauss <- mmd2_unbiased(X1, X2, sigma)
mmd2_obs_gauss

## [1] 0.002717003

set.seed(546)
B <- 1000
XY <- rbind(X1, X2)
n <- nrow(X1); m <- nrow(X2)
perm_vals <- numeric(B)

for (b in seq_len(B)) {
  idx <- sample.int(n + m)
  Xp <- XY[idx[1:n], , drop=FALSE]
  Yp <- XY[idx[(n+1):(n+m)], , drop=FALSE]
  perm_vals[b] <- mmd2_unbiased(Xp, Yp, sigma)
}

pval_gauss <- mean(perm_vals >= mmd2_obs_gauss)
c(MMD2 = mmd2_obs_gauss, p_value = pval_gauss)

##        MMD2     p_value 
## 0.002717003 0.000000000

p_value \(\le 0.05\), reject “same distribution”.

# selected_pixels: the column indices you constructed in Q3
# (must be defined; otherwise set selected_pixels <- 1:ncol(X1) to use all)
X1s <- X1[, selected_pixels, drop=FALSE]
X2s <- X2[, selected_pixels, drop=FALSE]

# if you changed bandwidth for  the selected-pixel kernel in Q3, reuse it here:
sigma_new <- if (exists("sigma_new")) sigma_new else sigma

mmd2_obs_new <- mmd2_unbiased(X1s, X2s, sigma_new)

set.seed(546)
perm_vals_new <- numeric(B)
XYs <- rbind(X1s, X2s)
for (b in seq_len(B)) {
  idx <- sample.int(n + m)
  Xp <- XYs[idx[1:n], , drop=FALSE]
  Yp <- XYs[idx[(n+1):(n+m)], , drop=FALSE]
  perm_vals_new[b] <- mmd2_unbiased(Xp, Yp, sigma_new)
}
pval_new <- mean(perm_vals_new >= mmd2_obs_new)
c(MMD2_new = mmd2_obs_new, p_value_new = pval_new)

##    MMD2_new p_value_new 
##   0.1796489   0.0000000

Take away: We compare p-values. If the new kernel is better aligned with the digit differences (characteristic or simply more discriminating for that structure), we will usually see a smaller p-value than with the Gaussian kernel; if it is mismatched, it can be larger.

Stat 546 Homework 2 - Solution

Assigned: Sep 8, 2025; Due: 11:59 PM CT, Sep 24, 2025

Instruction

Question 1: Kernel Functions [10 pts]

Question 2: Weighted SVM [30 pts]

Question 3: Kernel PCA [15 pts]

Question 4: MMD [15 pts]