Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: Kernel Functions [10 pts]

  1. In our lecture, we showed that the kernel function

\[ k(x, y) = (x^T y)^2 \]

achieves the feature mapping that contains all second order interaction terms. Use this to show that the kernel function

\[ k(x, y) = (x^T y + 1)^2 \]

where \(c > 0\) is a constant, achieves the feature mapping that contains all second order terms plus all first order terms and an intercept term.

Question 2: Weighted SVM [80 pts]

In a paper called Outcome Weighted Learning by Zhao et al. (2012) proposed to use the following objective function (see Equation 2.2 in the paper) to estimate optimal treatment regimes. This is a weighted support vector machine classification problem.

\[ \min_{f \in \mathcal{H}_K} \frac{1}{n} \sum_{i=1}^n w_i \big(1 - A_i f(x_i)\big)_+ + \lambda ||f||^2_K \]

where \(w_i\) is some observation weight, \(A_i \in \{-1, 1\}\) is a binary outcome, and \(\mathcal{H}_K\) is a RKHS with kernel function \(K\).

  1. [10 pts] Note that this is corresponding to a penalized version of the SVM. Please provide the primal form of the Lagrangian corresponding SVM optimization problem with a linear kernel.

  2. [20 pts] Please provide the dual form of the corresponding problem, and derive the final optimization problem based on the dual.

  3. [30 pts] Take the mixture.example dataset in the ElemStatLearn package, and use the following code to define the weights:

  library(ElemStatLearn)
  data(mixture.example)
  
  # the grid on x1 and x2
  px1 = mixture.example$px1
  px2 = mixture.example$px2
  
  # plot the data and true decision boundary
  prob <- mixture.example$prob
  prob.bayes <- matrix(prob, 
                       length(px1), 
                       length(px2))
  contour(px1, px2, prob.bayes, levels=0.5, lty=2, 
          labels="", xlab="x1",ylab="x2",
          main="SVM with linear kernal", col = "red", lwd = 2)
  points(mixture.example$x, col=ifelse(mixture.example$y==1, "darkorange", "deepskyblue"), pch = 19)

  # define the new weights
  w = rep(1, nrow(mixture.example$x))
  w[mixture.example$x[,1] < 0.8 & mixture.example$x[,2] > 1 & mixture.example$y == 0] = 10  

  # plot the data and visualized the ones with large weights
  plot(mixture.example$x, col = ifelse(mixture.example$y == 1, "darkorange", "deepskyblue"), cex = 0.5, pch = 19)
  points(mixture.example$x[w == 10, ], col = "darkgreen", pch = 19)

Explore the quadprog package in R to solve the dual problem you derived in part (b). For this question, change the inner product \(x_i^T x_j\) to a polynomial kernel with degree 4, and make sure to add an intercept term before applying the kernel function. The solve.QP() function in this function is specified as solve.QP(Dmat, dvec, Amat, bvec, meq = 0), where

For more details, you should read the documentation of the solve.QP() function. Perform the following

If you need a reference for what you are heading to, here is a plot of my fitted SVM

  1. [20 pts] Instead, you can use our understanding of the representer theorem to formulate the primal form. Note that it would be simply the original statement of the optimization problem at the beginning of this question, which is an unconstrained optimization problem. Please provide the form of your finite dimensional optimization problem. Still use the polynomial kernel with degree 4. You can use any optimization functions in R (e.g., optim()) to directly solve the primal problem. In this case, what is your decision function? Visualize the decision boundary again. Is it the same as the one you obtained from the dual form? Please provide a brief discussion of your result.

Question 3: Kernel PCA [30 pts]

The following code is for PCA of the hand written digits dataset in the ElemStatLearn package.

  library(ElemStatLearn)
  data(zip.train)
  
  # PCA
  pca <- prcomp(zip.train[, -1], center=TRUE, scale.=FALSE)
  
  # plot the first two PCs
  plot(pca$x[,1], pca$x[,2], col=zip.train[,1]+1, pch=19, cex = 0.5,
       xlab="First Principal Component", ylab="Second Principal Component")

The goal of this question is to implement kernel PCA by yourself. You can use any matrix decomposition functions in R (e.g., eigen(), svd(), etc.), but you should most of the main steps by yourself. Consider the following:

  1. Use a Gaussian kernel with bandwidth \(\sigma\). You have to choose your own bandwidth but it does NOT need to be optimized.
  2. Randomly select 50 images from each digit (0-9) to form a dataset with 500 images.
  3. Write your own code ot perform kernel PCA on the dataset, and visualize the first two kernel principal components.
  4. Is the result better than the linear PCA? Please provide a brief explanation.
  5. If you are asked to choose a better kernel (possibly define your own kernel function) to improve the result, what would you do? Is your kernel function positive definite? Please provide a brief explanation, and also plot the first two kernel principal components based on your new kernel. Your result does not need to be better than the Gaussian kernel, but you should provide a reasonable explanation of why you think your kernel is good.

Question 4: MMD [20 pts]

In your previous question, what are the two most difficult digits to separate? You can use the kernel mean embedding and maximum mean discrepancy (MMD) to test whether these two digits are from the same distribution.

  1. Select 100 random images from each of these two digits, and compute the MMD statistic based on a Gaussian kernel. You can use the same bandwidth as in your previous question. Please provide your own code for the MMD statistic.
  2. Use the permutation test to compute the p-value of the MMD statistic. You can use 1000 permutations. Is the p-value small enough to reject the null hypothesis that these two digits are from the same distribution? Please provide a brief summary of your findings.
  3. Use the new kernel that you defined in the previous question to compute the MMD statistic and p-value again. Is the p-value smaller than the one based on the Gaussian kernel? Please provide a brief discussion of your result.