Stat 546 Homework 2

Instruction
Question 1: Kernel Functions [10 pts]
Question 2: Weighted SVM [80 pts]
Question 3: Kernel PCA [30 pts]
Question 4: MMD [20 pts]

Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

You are required to submit the rendered file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file. .html format cannot be accepted since they may not be readable in Gradescope. All proofs must be typed in LaTeX format. Make all of your R code chunks visible for grading.
Include your Name and NetID in the report.
If you use this file or the example homework .Rmd file as a template, be sure to remove this instruction section.
On random seed and reproducibility: You should use R version \(\geq 4.0.0\). This will ensure your random seed generation is the same as everyone else. Please note that updating the R version may require you to reinstall all of your packages. In the markdown file, set seed properly so that the results can be reproduced.
For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.
Using AI tools: you should provide a brief statement on the involvement of AI tools on your homework.

Question 1: Kernel Functions [10 pts]

In our lecture, we showed that the kernel function

\[ k(x, y) = (x^T y)^2 \]

achieves the feature mapping that contains all second order interaction terms. Use this to show that the kernel function

\[ k(x, y) = (x^T y + 1)^2 \]

where \(c > 0\) is a constant, achieves the feature mapping that contains all second order terms plus all first order terms and an intercept term.

Question 2: Weighted SVM [80 pts]

In a paper called Outcome Weighted Learning by Zhao et al. (2012) proposed to use the following objective function (see Equation 2.2 in the paper) to estimate optimal treatment regimes. This is a weighted support vector machine classification problem.

\[ \min_{f \in \mathcal{H}_K} \frac{1}{n} \sum_{i=1}^n w_i \big(1 - A_i f(x_i)\big)_+ + \lambda ||f||^2_K \]

where \(w_i\) is some observation weight, \(A_i \in \{-1, 1\}\) is a binary outcome, and \(\mathcal{H}_K\) is a RKHS with kernel function \(K\).

[10 pts] Note that this is corresponding to a penalized version of the SVM. Please provide the primal form of the Lagrangian corresponding SVM optimization problem with a linear kernel.
[20 pts] Please provide the dual form of the corresponding problem, and derive the final optimization problem based on the dual.
[30 pts] Take the mixture.example dataset in the ElemStatLearn package, and use the following code to define the weights:

  library(ElemStatLearn)
  data(mixture.example)
  
  # the grid on x1 and x2
  px1 = mixture.example$px1
  px2 = mixture.example$px2
  
  # plot the data and true decision boundary
  prob <- mixture.example$prob
  prob.bayes <- matrix(prob, 
                       length(px1), 
                       length(px2))
  contour(px1, px2, prob.bayes, levels=0.5, lty=2, 
          labels="", xlab="x1",ylab="x2",
          main="SVM with linear kernal", col = "red", lwd = 2)
  points(mixture.example$x, col=ifelse(mixture.example$y==1, "darkorange", "deepskyblue"), pch = 19)

  # define the new weights
  w = rep(1, nrow(mixture.example$x))
  w[mixture.example$x[,1] < 0.8 & mixture.example$x[,2] > 1 & mixture.example$y == 0] = 10  

  # plot the data and visualized the ones with large weights
  plot(mixture.example$x, col = ifelse(mixture.example$y == 1, "darkorange", "deepskyblue"), cex = 0.5, pch = 19)
  points(mixture.example$x[w == 10, ], col = "darkgreen", pch = 19)

Explore the quadprog package in R to solve the dual problem you derived in part (b). For this question, change the inner product \(x_i^T x_j\) to a polynomial kernel with degree 4, and make sure to add an intercept term before applying the kernel function. The solve.QP() function in this function is specified as solve.QP(Dmat, dvec, Amat, bvec, meq = 0), where

Dmat is the quadratic coefficient matrix. The matrix must be positive definite. You can add a small value (e.g., 1e-5) to the diagonal of \(D\) to avoid numerical errors.
dvec is the linear term vector. The optimization problem is defined as \(\min \frac{1}{2} {\boldsymbol \alpha}^T D {\boldsymbol \alpha} - d^T \boldsymbol \alpha\).
Amat and bvec define the constraints such that \(A^T \alpha \geq b\). Note that the constraints are defined in terms of \(A^T\) instead of \(A\).
meq is the number of equality constraints so that the first meq constraints in \(A^T \alpha \geq b\) are treated as equality constraints. The rest are treated as one-sided inequality constraints as you specified.

For more details, you should read the documentation of the solve.QP() function. Perform the following

Investigate your dual form in the previous question and consider the vector \(\boldsymbol \alpha\) as the parameter to be optimized, then re-write the dual form (and constrains) as linear and quadratic terms of \(\boldsymbol \alpha\) with one-sided and exact equality constraints, so that they can be represented in the form required by the solve.QP() function.
Solve the dual problem using the solve.QP() function, and visualize the decision boundary. You may use the previous code chunk to visualize the fitted decision boundary by define your own prob.bayes matrix. The contour() function will help you visualize the decision boundary.
You will need to compute the intercept term \(b_0\) in the decision function \(f(x) = \sum_{i=1}^n \alpha_i y_i K(x_i, x) + b_0\). Although we did not discuss how to compute \(b_0\) in the lecture, the result is actually following the KKT condition, but with simple formula as \[ b_0 = \text{median}_{i \in \mathcal{S}} \left( y_i - \sum_{j=1}^n \alpha_j y_j K(x_j, x_i) \right) \] where \(\mathcal{S}\) is the set of indices of support vectors that are on the margin, i.e., \(0 < \alpha_i < C w_i\).
It should be expected that you will get a different decision boundary compared to the truth, but is it reasonable? Please provide a brief explanation.

If you need a reference for what you are heading to, here is a plot of my fitted SVM

[20 pts] Instead, you can use our understanding of the representer theorem to formulate the primal form. Note that it would be simply the original statement of the optimization problem at the beginning of this question, which is an unconstrained optimization problem. Please provide the form of your finite dimensional optimization problem. Still use the polynomial kernel with degree 4. You can use any optimization functions in R (e.g., optim()) to directly solve the primal problem. In this case, what is your decision function? Visualize the decision boundary again. Is it the same as the one you obtained from the dual form? Please provide a brief discussion of your result.

Question 3: Kernel PCA [30 pts]

The following code is for PCA of the hand written digits dataset in the ElemStatLearn package.

  library(ElemStatLearn)
  data(zip.train)
  
  # PCA
  pca <- prcomp(zip.train[, -1], center=TRUE, scale.=FALSE)
  
  # plot the first two PCs
  plot(pca$x[,1], pca$x[,2], col=zip.train[,1]+1, pch=19, cex = 0.5,
       xlab="First Principal Component", ylab="Second Principal Component")

The goal of this question is to implement kernel PCA by yourself. You can use any matrix decomposition functions in R (e.g., eigen(), svd(), etc.), but you should most of the main steps by yourself. Consider the following:

Use a Gaussian kernel with bandwidth \(\sigma\). You have to choose your own bandwidth but it does NOT need to be optimized.
Randomly select 50 images from each digit (0-9) to form a dataset with 500 images.
Write your own code ot perform kernel PCA on the dataset, and visualize the first two kernel principal components.
Is the result better than the linear PCA? Please provide a brief explanation.
If you are asked to choose a better kernel (possibly define your own kernel function) to improve the result, what would you do? Is your kernel function positive definite? Please provide a brief explanation, and also plot the first two kernel principal components based on your new kernel. Your result does not need to be better than the Gaussian kernel, but you should provide a reasonable explanation of why you think your kernel is good.

Question 4: MMD [20 pts]

In your previous question, what are the two most difficult digits to separate? You can use the kernel mean embedding and maximum mean discrepancy (MMD) to test whether these two digits are from the same distribution.

Select 100 random images from each of these two digits, and compute the MMD statistic based on a Gaussian kernel. You can use the same bandwidth as in your previous question. Please provide your own code for the MMD statistic.
Use the permutation test to compute the p-value of the MMD statistic. You can use 1000 permutations. Is the p-value small enough to reject the null hypothesis that these two digits are from the same distribution? Please provide a brief summary of your findings.
Use the new kernel that you defined in the previous question to compute the MMD statistic and p-value again. Is the p-value smaller than the one based on the Gaussian kernel? Please provide a brief discussion of your result.