Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf. For example,
HW01_rqzhu.pdf. Please note that this must be a
.pdf file. .html format
cannot be accepted since they may not be readable in
Gradescope. All proofs must be typed in LaTeX format. Make all of your
R code chunks visible for grading..Rmd file
as a template, be sure to remove this instruction
section.R version \(\geq 4.0.0\).
This will ensure your random seed generation is the same as everyone
else. Please note that updating the R version may require
you to reinstall all of your packages. In the markdown file, set
seed properly so that the results can be reproduced.In our lecture, we showed that the kernel function
\[ k(x, y) = (x^T y)^2 \]
achieves the feature mapping that contains all second order interaction terms. Use this to show that the kernel function
\[ k(x, y) = (x^T y + 1)^2 \]
where \(c > 0\) is a constant, achieves the feature mapping that contains all second order terms plus all first order terms and an intercept term.
Solution:
In general for \(k(x, y) = (x^T y + c)^2\) we can simply define
\[ \psi(x) = (\sqrt{c}, x) \]
For any \(x \in \mathbb{R}^d\). Then the kernel function is the quadratic kernel on \(x^\ast\), which includes all second order terms, first order terms, and an intercept term. For above case we can set \(c=1\). In particular: \[ (x^\top y + c)^2 \;=\; \big([\sqrt{c};\,x]^\top[\sqrt{c};\,y]\big)^2 \]
So with \(\psi(x)=(\sqrt{c},x)\) and \(\psi(y)=(\sqrt{c},y)\), the kernel is the homogeneous quadratic kernel \(K(\tilde x,\tilde y)=(\tilde x^\top \tilde y)^2\) in the augmented space. Expanding gives:
\[ (x^\top y + c)^2 = \underbrace{(x^\top y)^2}_{\text{all }x_ix_jy_iy_j} \;+\; 2c\,\underbrace{x^\top y}_{\text{all linear }x_i y_i} \;+\; c^2\ \underbrace{(1)}_{\text{intercept}}, \]
In a paper called Outcome Weighted Learning by Zhao et al. (2012) proposed to use the following objective function (see Equation 2.2 in the paper) to estimate optimal treatment regimes. This is a weighted support vector machine classification problem.
\[ \min_{f \in \mathcal{H}_K} \frac{1}{n} \sum_{i=1}^n w_i \big(1 - A_i f(x_i)\big)_+ + \lambda ||f||^2_K \]
where \(w_i\) is some observation weight, \(A_i \in \{-1, 1\}\) is a binary outcome, and \(\mathcal{H}_K\) is a RKHS with kernel function \(K\).
a. [10 pts] Note that this is corresponding to a penalized version of the SVM. Please provide the primal form of the Lagrangian corresponding SVM optimization problem with a linear kernel.
Solution:
The first term in the objective function is the empirical hinge loss function corresponding to the slack variables \(\xi_i = (1 - A_i f(x_i))_+\). And the second term corresponds to the norm term \(\frac{1}{2} \| \boldsymbol \beta \|_2^2\). Therefore, the primal form of the corresponding SVM optimization problem with a linear kernel is
\[ \begin{aligned} \min_{\boldsymbol \beta, \beta_0, \boldsymbol \xi} & \quad \frac{1}{2} \| \boldsymbol \beta \|_2^2 + C \sum_{i=1}^n w_i \xi_i \\ \text{s.t.} & \quad A_i (\boldsymbol \beta^T x_i + \beta_0) \geq 1 - \xi_i, \quad i = 1, \ldots, n \\ & \quad \xi_i \geq 0, \quad i = 1, \ldots, n \end{aligned} \]
The Lagrangian function for this problem is given by
\[ \begin{aligned} L(\boldsymbol \beta, \beta_0, \boldsymbol \xi, \boldsymbol \alpha, \boldsymbol \gamma) = & \frac{1}{2} \| \boldsymbol \beta \|_2^2 + C \sum_{i=1}^n w_i \xi_i - \sum_{i=1}^n \alpha_i [A_i (\boldsymbol \beta^T x_i + \beta_0) - 1 + \xi_i] - \sum_{i=1}^n \gamma_i \xi_i \end{aligned} \]
where \(\boldsymbol \alpha = (\alpha_1, \ldots, \alpha_n)\) and \(\boldsymbol \gamma = (\gamma_1, \ldots, \gamma_n)\) are the Lagrange multipliers associated with the constraint. The primal form is
\[ \min_{\boldsymbol \beta, \beta_0, \boldsymbol \xi} \max_{\boldsymbol \alpha > 0, \boldsymbol \gamma > 0} \quad L(\boldsymbol \beta, \beta_0, \boldsymbol \xi, \boldsymbol \alpha, \boldsymbol \gamma) \]
The optimization of this problem is almost equivalent to the standard SVM, except that the slack variables \(\xi_i\) are weighted by \(w_i\).
b. [20 pts] Please provide the dual form of the corresponding problem, and derive the final optimization problem based on the dual.
Solution:
To derive the dual form, we first take the partial derivatives of the Lagrangian with respect to the primal variables \(\boldsymbol \beta\), \(\beta_0\), and \(\boldsymbol \xi\), and set them to zero.
\[ \begin{aligned} \frac{\partial L}{\partial \boldsymbol \beta} &= \boldsymbol \beta - \sum_{i=1}^n \alpha_i A_i x_i = 0 &\implies& \boldsymbol \beta = \sum_{i=1}^n \alpha_i A_i x_i \\ \frac{\partial L}{\partial \beta_0} &= -\sum_{i=1}^n \alpha_i A_i = 0 &\implies& \sum_{i=1}^n \alpha_i A_i = 0 \\ \frac{\partial L}{\partial \xi_i} &= C w_i - \alpha_i - \gamma_i = 0 &\implies& \alpha_i + \gamma_i = C w_i \\ \end{aligned} \]
Substituting these conditions back into the Lagrangian, we obtain the dual form:
\[ \begin{aligned} L_D(\boldsymbol \alpha) = & \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j A_i A_j x_i^T x_j \\ \text{s.t.} & \quad 0 \leq \alpha_i \leq C w_i, \quad i = 1, \ldots, n \\ & \quad \sum_{i=1}^n \alpha_i A_i = 0 \end{aligned} \]
c. [30 pts] Take the
mixture.example dataset in the ElemStatLearn
package, and use the following code to define the
weights:
library(ElemStatLearn)
data(mixture.example)
# the grid on x1 and x2
px1 = mixture.example$px1
px2 = mixture.example$px2
# plot the data and true decision boundary
prob <- mixture.example$prob
prob.bayes <- matrix(prob,
length(px1),
length(px2))
contour(px1, px2, prob.bayes, levels=0.5, lty=2,
labels="", xlab="x1",ylab="x2",
main="True Boundary", col = "red", lwd = 2)
points(mixture.example$x, col=ifelse(mixture.example$y==1, "darkorange", "deepskyblue"), pch = 19)
# define the new weights
w = rep(1, nrow(mixture.example$x))
w[mixture.example$x[,1] < 0.8 & mixture.example$x[,2] > 1 & mixture.example$y == 0] = 10
# plot the data and visualized the ones with large weights
plot(mixture.example$x, col = ifelse(mixture.example$y == 1, "darkorange", "deepskyblue"),
cex = 0.5, pch = 19)
points(mixture.example$x[w == 10, ], col = "darkgreen", pch = 19)
Explore the quadprog
package in R to solve the dual problem you derived in part
(b). For this question, change the inner product \(x_i^T x_j\) to a polynomial kernel with
degree 4, and make sure to add an intercept term before applying the
kernel function. The solve.QP() function in this function
is specified as solve.QP(Dmat, dvec, Amat, bvec, meq = 0),
where
Dmat is the
quadratic coefficient matrix. The matrix must be positive definite. You
can add a small value (e.g., 1e-5) to the diagonal of \(D\) to avoid numerical
errors.
dvec is the linear
term vector. The optimization problem is defined as \(\min \frac{1}{2} {\boldsymbol \alpha}^T D
{\boldsymbol \alpha} - d^T \boldsymbol
\alpha\).
Amat and
bvec define the constraints such that \(A^T \alpha \geq b\). Note that the
constraints are defined in terms of \(A^T\) instead of \(A\).
meq is the number
of equality constraints so that the first meq constraints
in \(A^T \alpha \geq b\) are treated as
equality constraints. The rest are treated as one-sided inequality
constraints as you specified.
For more details, you should read the documentation
of the solve.QP() function. Perform the following
Investigate your dual form in
the previous question and consider the vector \(\boldsymbol \alpha\) as the parameter to be
optimized, then re-write the dual form (and constrains) as linear and
quadratic terms of \(\boldsymbol
\alpha\) with one-sided and exact equality constraints, so that
they can be represented in the form required by the
solve.QP() function.
Solve the dual problem using the
solve.QP() function, and visualize the decision boundary.
You may use the previous code chunk to visualize the fitted decision
boundary by define your own prob.bayes matrix. The
contour() function will help you visualize the decision
boundary.
You will need to compute the intercept term \(b_0\) in the decision function \(f(x) = \sum_{i=1}^n \alpha_i y_i K(x_i, x) + b_0\). Although we did not discuss how to compute \(b_0\) in the lecture, the result is actually following the KKT condition, but with simple formula as \[ b_0 = \text{median}_{i \in \mathcal{S}} \left( y_i - \sum_{j=1}^n \alpha_j y_j K(x_j, x_i) \right) \] where \(\mathcal{S}\) is the set of indices of support vectors that are on the margin, i.e., \(0 < \alpha_i < C w_i\).
It should be expected that you will get a different decision boundary compared to the truth, but is it reasonable? Please provide a brief explanation.
Solution:
# ------------------------------------------------------------
# Weighted kernel SVM via dual (quadprog) on ElemStatLearn::mixture.example
# Kernel: polynomial degree 3 with intercept term: K(x,z) = (1 + x^T z)^3
# Dual:
# minimize (1/2) alpha^T D alpha - 1^T alpha
# subject to:
# y^T alpha = 0
# 0 <= alpha_i <= C * w_i
# where D = (y y^T) ⊙ K
# ------------------------------------------------------------
library(ElemStatLearn)
library(quadprog)
# --- Data and Bayes boundary (for reference plot) ---
data(mixture.example)
X <- mixture.example$x # n x 2
y01 <- mixture.example$y # 0/1 labels
y <- ifelse(y01 == 1, 1, -1) # convert to {-1, +1}
n <- nrow(X)
px1 <- mixture.example$px1
px2 <- mixture.example$px2
prob <- mixture.example$prob
prob.bayes <- matrix(prob, length(px1), length(px2))
# --- Define instance weights (your rule) ---
w <- rep(1, n)
w[X[,1] < 0.8 & X[,2] > 1 & y01 == 0] <- 10
# --- Hyperparameters ---
Cval <- 1.0 # C parameter (feel free to adjust)
eps <- 1e-8 # tolerance for support vectors
ridge <- 1e-5 # tiny diagonal add for numerical PD
# --- Degree-4 polynomial kernel with intercept term ---
# Equivalent to augmenting x with a leading 1 before the dot-product.
kpoly_4 <- function(A, B) {
(1 + A %*% t(B))^4
}
# --- Build kernel matrix and D = (y y^T) ⊙ K ---
K <- kpoly_4(X, X)
Youter <- outer(y, y) # y y^T
D <- K * Youter
D <- D + diag(ridge, n) # ensure positive-definiteness for QP
# --- QP components for quadprog::solve.QP
# We solve: min (1/2) alpha^T D alpha - d^T alpha
# with constraints A^T alpha >= b; first meq constraints are equalities.
dvec <- rep(1, n)
# Constraints:
# 1) Equality: y^T alpha = 0 -> Aeq = y, beq = 0
Aeq <- matrix(y, nrow = n, ncol = 1) # n x 1
beq <- 0
# 2) Inequalities for box constraints 0 <= alpha_i <= C * w_i
# alpha >= 0 -> I^T alpha >= 0 (quadprog uses A^T alpha >= b)
# -alpha >= -C w -> (-I)^T alpha >= -C w i.e., alpha <= C w
A_ge0 <- diag(n) # n x n
b_ge0 <- rep(0, n)
A_leCw <- -diag(n) # n x n
b_leCw <- -Cval * w
# Combine: Amat is n x (1 + 2n); bvec length 1 + 2n
Amat <- cbind(Aeq, A_ge0, A_leCw) # (n x (1+2n))
bvec <- c(beq, b_ge0, b_leCw)
# Number of equalities:
meq <- 1
# --- Solve QP ---
qp_fit <- solve.QP(Dmat = D, dvec = dvec, Amat = Amat, bvec = bvec, meq = meq)
alpha <- qp_fit$solution
# --- Identify support vectors and compute intercept b0
at_lower <- alpha < eps
at_upper <- (Cval * w - alpha) < eps
is_SV_margin <- (!at_lower) & (!at_upper) # 0 < alpha_i < C w_i
# Decision function without intercept: f0(z) = sum_i alpha_i y_i K(x_i, z)
f0 <- function(Z) {
Kxz <- kpoly_4(X, Z) # n x m
as.numeric(t(alpha * y) %*% Kxz) # 1 x m -> numeric vector
}
# Intercept via KKT on margin SVs; if none, use all alpha_i > eps as fallback
idx <- which(is_SV_margin)
if (length(idx) == 0L) idx <- which(alpha > eps)
# b0 = median over chosen SVs of [ y_i - sum_j alpha_j y_j K(x_j, x_i) ]
b_candidates <- y[idx] - f0(X[idx, , drop = FALSE])
b0 <- median(b_candidates)
# Final decision function
f <- function(Z) f0(Z) + b0
pred <- function(Z) ifelse(f(Z) >= 0, 1, -1)
# --- Visualization: decision boundary on provided grid (level = 0) ---
# The mixture.example grid order matches matrix(prob, length(px1), length(px2)).
# We'll compute f over the same grid ordering and reshape.
grid <- expand.grid(px1, px2) # rows vary px1 fastest by default
f_grid <- f(as.matrix(grid))
f_mat <- matrix(f_grid, nrow = length(px1), ncol = length(px2))
# --- Plots: Bayes boundary (dashed) and learned boundary (solid) ---
par(mfrow = c(1, 1))
contour(px1, px2, prob.bayes, levels = 0.5, lty = 2,
labels = "", xlab = "x1", ylab = "x2",
main = "Weighted SVM (poly-4 kernel) vs Bayes boundary",
col = "red", lwd = 2)
points(X, col = ifelse(y01 == 1, "darkorange", "deepskyblue"), pch = 19)
# our learned boundary at f=0
contour(px1, px2, f_mat, levels = 0, add = TRUE, lwd = 2)
# Optionally highlight heavy-weighted points
points(X[w == max(w), , drop = FALSE], pch = 1, cex = 1.2, lwd = 2)
legend("bottomleft",
legend = c("Bayes (0.5)", "SVM boundary", "class 1", "class 0", "high weight"),
col = c("red", "black", "darkorange", "deepskyblue", "black"),
lty = c(2, 1, NA, NA, NA),
pch = c(NA, NA, 19, 19, 1),
lwd = c(2, 2, NA, NA, 2),
bty = "n")
d. [20 pts] Instead, you can use our
understanding of the representer theorem to formulate the primal form.
Note that it would be simply the original statement of the optimization
problem at the beginning of this question, which is an unconstrained
optimization problem. Please provide the form of your finite dimensional
optimization problem. Still use the polynomial kernel with degree 4. You
can use any optimization functions in R (e.g.,
optim()) to directly solve the primal problem. In this
case, what is your decision function? Visualize the decision boundary
again. Is it the same as the one you obtained from the dual form? Please
provide a brief discussion of your result.
Solution: By the representer theorem, the RKHS minimizer can be written as
\[ f(x)=\sum_{j=1}^n \beta_j K(x,x_j)+b_0, \]
so with our degree-4 polynomial kernel \(K(x,z)=(1+x^\top z)^4\), the finite-dimensional problem in \((\beta,b_0)\) becomes \[ \min_{\beta\in\mathbb{R}^n,;b_0\in\mathbb{R}} \frac{1}{n}\sum_{i=1}^n w_i\bigl(1-y_i[(K\beta)_i+b_0]\bigr)_+ +\lambda,\beta^\top K\beta \] where \(K\in\mathbb{R}^{n\times n}\) is the Gram matrix with \(K_{ij}=K(x_i,x_j)\) and \((K\beta)_i=\sum_j K_{ij}\beta_j\). To match the usual soft-margin parameter \(C\) used in the dual, we set
\[ \lambda=\frac{1}{2nC} \]
In the code, \(K\) is built via \(K(X,X)=(1+X X^\top)^4\), the weights \(w\) are set to 10 for selected class-0 points and 1 otherwise, and we minimize the weighted average hinge loss plus the RKHS penalty with L-BFGS-B. For numerical stability (mirroring the dual’s tiny ridge on \(D\)), we optionally use \(K_r=K+\delta I\) with \(\delta=10^{-5}\) only in the penalty:
\[ \text{obj}(\beta,b_0)=\frac{1}{n}\sum_i w_i\bigl(1-y_i[(K\beta)_i+b_0]\bigr)_+ + \lambda,\beta^\top K_r\beta \]
The fitted decision function used for prediction and plotting is
\[ \hat f(x)=\sum_{j=1}^n \hat\beta_j K(x,x_j)+\hat b_0, \]
implemented as
f_primal(Z) = kpoly4(Z, X) %*% beta_hat + b0_hat. We then
evaluate \(\hat f\) on the plotting
grid \((px1,px2)\) and draw the
decision boundary with contour(..., levels = 0). In the
code, the matrix \(Fmat\) is reshaped
so that its dimensions match contour(px1, px2, Fmat).
This primal solution matches the dual (up to small numerical differences) when:
\(C\) and \(\lambda\) are linked by \(\lambda=1/(2nC)\),
the same kernel and weights are used, and
the grid reshape is consistent with \((px1,px2)\).
# -------------------------------------------------------------------
# Weighted SVM via the representer theorem (primal , poly-4 kernel)
# -------------------------------------------------------------------
X <- mixture.example$x # n × 2 input matrix
y01 <- mixture.example$y # {0,1} labels
y <- ifelse(y01 == 1, +1, -1) # convert to {+1,−1}
n <- nrow(X)
w <- rep(1, n)# define weights: upweight class-0 points in region
w[X[,1] < 0.8 & X[,2] > 1 & y01 == 0] <- 10
# Build the degree-4 polynomial kernel with intercept
kpoly4 <- function(A, B) {
(1 + A %*% t(B))^4}
K <- kpoly4(X, X) # n×n Gram matrix
# regularization parameter λ
n <- nrow(X)
Cval <- 1.0 # whatever you used in the dual
lambda <- 1/(2 * n * Cval)
# Define a smooth approximation to the hinge loss and its derivative
smooth_hinge <- function(m, tau = 1e-1) {
# m = y * f; tau = smoothing parameter
u <- 1 - m
loss <- ifelse(u <= 0,0, ifelse(u < tau, u^2/(2*tau), u - tau/2))
dLdm <- ifelse(u <= 0,0, ifelse(u < tau, u/tau, 1))
list(loss = loss, grad = dLdm)
}
# Objective + gradient for optim()
# θ = c(beta (length n), b₀ (scalar))
obj_and_grad <- function(theta) {
beta <- theta[1:n]
b0 <- theta[n+1]
# compute f_i = (K %*% beta)_i + b0
f <- as.numeric(K %*% beta) + b0
# compute smooth hinge on margins m_i = y_i * f_i
K_r <- K + diag(1e-5, n)
sh <- smooth_hinge(y * f)
# objective: weighted avg. loss + RKHS penalty
obj <- (1/n) * sum(w * sh$loss) + lambda * as.numeric(t(beta) %*% K_r %*% beta)
# gradient w.r.t. f: dℓ/df = −y * dLdm
dL_df <- - (y * sh$grad) * w / n
# chain rule → grads
grad_b0 <- sum(dL_df) # ∂/∂b0
grad_beta <- 2 * lambda * (K_r %*% beta) + (K %*% dL_df)
grad <- c(as.numeric(grad_beta), grad_b0)
attr(obj, "gradient") <- grad
obj
}
# Run L-BFGS optimization
init_theta <- c(alpha * y, b0)
fit <- optim(init_theta,
fn = obj_and_grad,
gr = NULL, # gradient is returned via attr(obj,"gradient")
method = "L-BFGS-B",
control = list(maxit = 1000, factr = 1e7, pgtol = 1e-5),
hessian = FALSE)
beta_hat <- fit$par[1:n]
b0_hat <- fit$par[n+1]
# 7. Define decision function
f_primal <- function(Z) {
# Z: m×2 matrix of new points
as.numeric(kpoly4(Z, X) %*% beta_hat + b0_hat)
}
# Plot the decision boundary vs. true Bayes boundary
px1 <- mixture.example$px1
px2 <- mixture.example$px2
prob.bayes <- matrix(mixture.example$prob,length(px1), length(px2))
# compute f on grid
grid <- expand.grid(px1, px2)
fvals <- f_primal(as.matrix(grid))
Fmat <- matrix(fvals, nrow=length(px1),ncol=length(px2),byrow=FALSE)
# plot
prob.bayes <- matrix(mixture.example$prob, length(px1), length(px2))
contour(px1, px2, prob.bayes, levels = 0.5, lty = 2,
col = "red", lwd = 2,
xlab = "x1", ylab = "x2",
main = "Primal Weighted SVM (poly-4 kernel)")
points(X, col = ifelse(y01==1,"darkorange","deepskyblue"), pch = 19)
contour(px1, px2, Fmat, levels = 0, add = TRUE, lwd = 2)
legend("bottomleft",
legend = c("Bayes (0.5)", "Primal SVM boundary", "class 1","class 0","high weight"),
col = c("red","black","darkorange","deepskyblue","deepskyblue"),
lty = c(2,1,NA,NA,NA),
pch = c(NA,NA,19,19,1),
lwd = c(2,2,NA,NA,2),
bty = "n")
The following code is for PCA of the
hand written digits dataset in the ElemStatLearn
package.
library(ElemStatLearn)
data(zip.train)
# PCA
pca <- prcomp(zip.train[, -1], center=TRUE, scale.=FALSE)
# plot the first two PCs
plot(pca$x[,1], pca$x[,2], col=zip.train[,1]+1, pch=19, cex = 0.5,
xlab="First Principal Component", ylab="Second Principal Component")
The goal of this question is to
implement kernel PCA by yourself. You can use any matrix decomposition
functions in R (e.g., eigen(),
svd(), etc.), but you should most of the main steps by
yourself. Consider the following:
Solution:
library(ElemStatLearn)
set.seed(123)
data(zip.train)
# randomly select 20 images from each digit
selected_indices <- unlist(lapply(0:9, function(d) {
sample(which(zip.train[, 1] == d), 50)
}))
data_subset <- zip.train[selected_indices, -1]
n <- nrow(data_subset)
# Gaussian kernel function
gaussian_kernel <- function(x, y, sigma) {
exp(-sum((x - y)^2) / (2 * sigma^2))
}
# compute a median distance as the bandwidth
dist_matrix <- as.matrix(dist(data_subset))
median_dist <- median(dist_matrix[upper.tri(dist_matrix)])
sigma <- median_dist
# Compute the kernel matrix
K <- matrix(0, n, n)
for (i in 1:n) {
for (j in 1:n) {
K[i, j] <- gaussian_kernel(data_subset[i, ], data_subset[j, ], sigma)
}
}
# Center the kernel matrix
one_n <- matrix(1/n, n, n)
K_centered <- K - one_n %*% K - K %*% one_n + one_n %*% K %*% one_n
# Eigen decomposition
eig <- eigen(K_centered)
# Get the top two eigenvectors
pc1 <- eig$vectors[, 1]
pc2 <- eig$vectors[, 2]
# Plot the first two kernel principal components
plot(pc1, pc2, col=zip.train[selected_indices, 1] + 1, pch=19, cex=0.5,
xlab="First Kernel Principal Component", ylab="Second Kernel Principal Component",
main="Kernel PCA with Gaussian Kernel")
legend("topright", legend=0:9, col=2:11, pch=19, bty="n")
This result is not really better than the linear PCA. The reason could be that the Gaussian kernel can be affected by the dimensionality of the data. The images are 256-dimensional, and the Gaussian kernel may not effectively capture the structure in such high-dimensional space. I am going to cheat a bit by doing this adaptively. First, I get the top 100 pixels by their variance across different digits.
# each pair of digits, I find the pixel that is most different
pixel_select = matrix(NA, nrow=10, ncol=10)
for (i in 1:10) {
for (j in 1:10) {
if (i < j)
pixel_select[i, j] = which.max(apply(data_subset[zip.train[selected_indices, 1] == (i-1), ] -
data_subset[zip.train[selected_indices, 1] == (j-1), ], 2, var))
}
}
# get all pixels that are selected
selected_pixels = unique(as.vector(pixel_select))
selected_pixels = selected_pixels[!is.na(selected_pixels)]
data_subset2 = data_subset[, selected_pixels]
# perform kernel PCA again
n <- nrow(data_subset2)
# compute a median distance as the bandwidth
dist_matrix <- as.matrix(dist(data_subset2))
median_dist <- median(dist_matrix[upper.tri(dist_matrix)])
sigma <- median_dist/2.1
# Compute the kernel matrix
K <- matrix(0, n, n)
for (i in 1:n) {
for (j in 1:n) {
K[i, j] <- gaussian_kernel(data_subset2[i, ], data_subset2[j, ], sigma)
}
}
# Center the kernel matrix
one_n <- matrix(1/n, n, n)
K_centered <- K - one_n %*% K - K %*% one_n + one_n %*% K %*% one_n
# Eigen decomposition
eig <- eigen(K_centered)
# Get the top two eigenvectors
pc1 <- eig$vectors[, 1]
pc2 <- eig$vectors[, 2]
# Plot the first two kernel principal components
plot(pc1, pc2, col=zip.train[selected_indices, 1] + 1, pch=19, cex=0.5,
xlab="First Kernel Principal Component", ylab="Second Kernel Principal Component",
main="Kernel PCA with Gaussian Kernel (selected pixels)")
legend("topright", legend=0:9, col=2:11, pch=19, bty="n")
Its probably not improving the result by too much, but I think this is a reasonable kernel. The reason is that the kernel is still positive definite since it is still a Gaussian kernel, just on a smaller set of the features. And the selected pixels are the ones that are most different across different digits, so they should be more informative than the other pixels.
In your previous question, what are the two most difficult digits to separate? You can use the kernel mean embedding and maximum mean discrepancy (MMD) to test whether these two digits are from the same distribution.
Solution:
# generally we want to Replace dA/dB with the two digits found hardest to separate in Q3 (e.g., 3 and 5).
set.seed(546)
library(ElemStatLearn)
data(zip.train)
dA <- 3; dB <- 5 # <-- set to your two hardest digits
A_idx <- which(zip.train[,1] == dA)
B_idx <- which(zip.train[,1] == dB)
A_samp <- sample(A_idx, 100)
B_samp <- sample(B_idx, 100)
X1 <- as.matrix(zip.train[A_samp, -1]) # 100 x p (digit dA)
X2 <- as.matrix(zip.train[B_samp, -1]) # 100 x p (digit dB)
# bandwidth: reuse your Q3 sigma; if not in scope, recompute from X1???X2
if (!exists("sigma")) {
XY <- rbind(X1, X2)
D <- as.matrix(dist(XY))
sigma <- median(D[upper.tri(D)])
}
# fast Gaussian Gram blocks
gk_blocks <- function(X, Y, sigma) {
A2 <- rowSums(X^2); B2 <- rowSums(Y^2)
Kxx <- exp(-(outer(A2, A2, "+") - 2 * (X %*% t(X))) / (2*sigma^2))
Kyy <- exp(-(outer(B2, B2, "+") - 2 * (Y %*% t(Y))) / (2*sigma^2))
Kxy <- exp(-(outer(A2, B2, "+") - 2 * (X %*% t(Y))) / (2*sigma^2))
diag(Kxx) <- 0; diag(Kyy) <- 0 # U-stat: remove diagonals
list(Kxx=Kxx, Kyy=Kyy, Kxy=Kxy)
}
mmd2_unbiased <- function(X, Y, sigma) {
n <- nrow(X); m <- nrow(Y)
Ks <- gk_blocks(X, Y, sigma)
term_xx <- sum(Ks$Kxx) / (n*(n-1))
term_yy <- sum(Ks$Kyy) / (m*(m-1))
term_xy <- mean(Ks$Kxy)
term_xx + term_yy - 2*term_xy
}
mmd2_obs_gauss <- mmd2_unbiased(X1, X2, sigma)
mmd2_obs_gauss
## [1] 0.002717003
set.seed(546)
B <- 1000
XY <- rbind(X1, X2)
n <- nrow(X1); m <- nrow(X2)
perm_vals <- numeric(B)
for (b in seq_len(B)) {
idx <- sample.int(n + m)
Xp <- XY[idx[1:n], , drop=FALSE]
Yp <- XY[idx[(n+1):(n+m)], , drop=FALSE]
perm_vals[b] <- mmd2_unbiased(Xp, Yp, sigma)
}
pval_gauss <- mean(perm_vals >= mmd2_obs_gauss)
c(MMD2 = mmd2_obs_gauss, p_value = pval_gauss)
## MMD2 p_value
## 0.002717003 0.000000000
p_value \(\le 0.05\), reject “same distribution”.
# selected_pixels: the column indices you constructed in Q3
# (must be defined; otherwise set selected_pixels <- 1:ncol(X1) to use all)
X1s <- X1[, selected_pixels, drop=FALSE]
X2s <- X2[, selected_pixels, drop=FALSE]
# if you changed bandwidth for the selected-pixel kernel in Q3, reuse it here:
sigma_new <- if (exists("sigma_new")) sigma_new else sigma
mmd2_obs_new <- mmd2_unbiased(X1s, X2s, sigma_new)
set.seed(546)
perm_vals_new <- numeric(B)
XYs <- rbind(X1s, X2s)
for (b in seq_len(B)) {
idx <- sample.int(n + m)
Xp <- XYs[idx[1:n], , drop=FALSE]
Yp <- XYs[idx[(n+1):(n+m)], , drop=FALSE]
perm_vals_new[b] <- mmd2_unbiased(Xp, Yp, sigma_new)
}
pval_new <- mean(perm_vals_new >= mmd2_obs_new)
c(MMD2_new = mmd2_obs_new, p_value_new = pval_new)
## MMD2_new p_value_new
## 0.1796489 0.0000000
Take away: We compare p-values. If the new kernel is better aligned with the digit differences (characteristic or simply more discriminating for that structure), we will usually see a smaller p-value than with the Gaussian kernel; if it is mismatched, it can be larger.