STAT 542: Homework 9

Instruction
About HW9
Question 1 [45 Points] Kernel Ridge Regression
Question 2 [55 Points] Non-linear SVM as Penalized Version

Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to compass2g. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

What is expected for the submission to Gradescope
- You are required to submit one rendered PDF file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file generated by a .Rmd file. .html format cannot be accepted.
- Please follow the instructions on Gradescope to select corresponding PDF pages for each question.
Please note that your homework file is a PDF report instead of a messy collection of R codes. This report should include:
- Your Name and NetID. (Replace Ruoqing Zhu (rqzhu) by your name and NetID if you are using this template).
- Make all of your R code chunks visible for grading.
- Relevant outputs from your R code chunks that support your answers.
- Provide clear answers or conclusions for each question. For example, you could start with Answer: I fit SVM with the following choice of tuning parameters ...
- Many assignments require your own implementation of algorithms. Basic comments are strongly encouraged to explain the logic to our graders. However, line-by-line code comments are unnecessary.
Requirements regarding the .Rmd file.
- You do NOT need to submit Rmd files. However, your PDF file should be rendered directly from it.
- Make sure that you set random seeds for simulation or randomized algorithms so that the results are reproducible. If a specific seed number is not provided in the homework, you can consider using your NetID.
- For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.

About HW9

In this HW, we use two examples to demonstrate how the RKHS and Representer Theorem are used in practice. In both questions, we will use the radial basis kernel, defined as

\[K(\mathbf{x}, \mathbf{z}) = e^{- \frac{\lVert \mathbf{x} - \mathbf{z}\rVert^2}{2\sigma^2}}.\] You are not required to tune the parameter \(\sigma\).

Question 1 [45 Points] Kernel Ridge Regression

Let’s first generate a set of data using the following code (or some similar code in Python):

set.seed(1)
n = 500
p = 2
X = matrix(rnorm(n*p, 0, 1), n, p)
y = 2*sin(X[, 1]*2) + 2*atan(X[, 2]*4) + rnorm(n, 0, 0.5)

alldata = cbind(X, y)
train = alldata[1:200, ]
test = alldata[201:500, ]

[5 Points] As a comparison, we can first fit a ridge regression to this data. Use the train part to fit a ridge regression using the glmnet() package with 10-fold cross-validation. Use lambda.min as your tuning parameter to predict the testing data. What is the prediction error?

To fit a kernel ridge regression, we use the following formulation:

\[\underset{\boldsymbol \alpha}{\text{minimize}} \quad \frac{1}{n} \big\lVert \mathbf{y} -\mathbf{K} \boldsymbol \alpha \big\rVert^2 + \lambda \boldsymbol \alpha^\text{T} \mathbf{K} \alpha\]

It should be noted that you could add an \(\alpha_0\) term to this regression to rigorously add the intercept term. However, the theoretical mean of \(Y\) is zero. Hence, it is not a crucial issue, and not required here. Following our derivation in the class, perform the following to complete this kernel ridge regression:

[10 Points] Compute the 200 \(\times\) 200 kernel matrix \(\mathbf{K}\). Use a Gaussian kernel and \(\sigma = 1\).
[10 Points] Based on the kernel ridge regression derivation, get the solution \(\boldsymbol \alpha\). Use \(\lambda = 0.01\)
[10 Points] The prediction of kernel ridge regression involves computing the kernel between a testing data and all the training data. To be specific, if we have a new testing data \(z\), then the prediction \(\widehat{f}(z)\) is \[\widehat{f}(z) = \sum_{i = 1}^n \alpha_i K(z, x_i)\] Since we have 300 testing data, this involves calculating a 300 \(\times\) 200 testing data kernel matrix, denoted as \(K_\text{test}\), and the vector of prediction is \(K_\text{test} \boldsymbol \alpha\).
[10 Points] Predict the testing data using the kernel and \(\widehat{\boldsymbol \alpha}\). What is the prediction error? Is it better than the ridge regression?

Answer:

Ridge Regression

library(glmnet)
# ridge regression uses `alpha = 0`
ridge.fit = cv.glmnet(x = train[, 1:2], y = train[, 3], alpha = 0)
mean((predict(ridge.fit, test[, 1:2], s = "lambda.min") - test[, 3])^2)

## [1] 3.205833

sigma = 1
K = exp(- 0.5* as.matrix(dist(train[, 1:2]))^2 / sigma^2)

lambda = 0.01
alpha = solve(K + 200*lambda*diag(200)) %*% train[, 3]

Ktest = matrix(NA, 300, 200)
for (i in 1:300)
{
  Ktest[i, ] = exp( - rowSums((sweep(train[, 1:2], 2, test[i, 1:2], FUN = "-"))^2) / 2 / sigma^2)
}

ypred = Ktest %*% alpha
mean((test[, 3] - ypred)^2)

## [1] 0.7995524

The prediction error of kernel ridge regression is 0.80, which is smaller than that of ridge regression: 3.205. Because the signal is non-linear.

Question 2 [55 Points] Non-linear SVM as Penalized Version

Recall that in HW8, we solved SVM using a penalized loss framework, with the logistic loss function: \[L(y, f(x)) = \log(1 + e^{- y f(x)}).\]

Again, we can specify the form of \(f(\cdot)\) as in a RKHS. And the penalized objective function becomes

\[\frac{1}{n} \sum_{i=1}^n L(y_i, K_i^\text{T} \boldsymbol \alpha) + \lambda \alpha^\text{T} \mathbf{K} \alpha\] where \(\mathbf{K}\) is the same \(n \times n\) kernel matrix as before and \(K_i\) is the \(i\)th column of \(\mathbf{K}\). The data are generated by the following code.

set.seed(1)
n = 500
p = 2
X = matrix(runif(n*p, -2, 2), n, p)
y = sign( X[, 2] - sin(pi*X[, 1]))
alldata = cbind(X, y)
train = alldata[1:200, ]
test = alldata[201:500, ]

# visualize the data 
plot(train[, 1], train[, 2], col = ifelse(y > 0, "red", "blue"), 
     pch = 19, )
lines(seq(-3, 3, 0.01), sin(pi*seq(-3, 3, 0.01)) )

Similar to the HW8 linear SVM penalized version, we perform the following steps to complete this model fitting:

[15 Points] Write a penalized loss objective function SVMfn(b, K, y, lambda) corresponding to the form we specified. Make sure to use the \(1/n\) scale in the loss function part.
[20 Points] Drive the gradient of the loss function, typeset with LaTex. Then write a gradient function SVMgr(b, K, y, lambda) to implement it.
[20 Points] Use the optim() function to solve this optimization problem by supplying the objective function and gradient function. Use \(\sigma = 0.2\) and \(\lambda = 0.01\). Initialize your coefficients as zeros. Report the following:
- The optimal loss function
- A confusion table of the training data
- Mis-classification rate on training data.
It could be difficult to obtain the decision line itself. However, its relatively easy to obtain the fitted label for the testing data. Hence, calculate the fitted label of the testing data and report the classification error. Plot the testing data using the fitted labels as colors. You should also add the true decision line (since you already know the true data generator). This would allow you to visualize (approximately) the decision line. You do not need to plot the original labels.

Note: 1) In a real problem, you can perform cross-validation to find the best tuning parameters. 2) You may also add a constant term \(\alpha_0\) to the problem to take care of intercept. These two are not required for the HW.

Answer:

The calculation of gradient is similar to HW8. Two differences are

We have a more complicated \(f(x_i) = K_i^T \boldsymbol\alpha\) instead of \(x_i^T \beta + \beta_0\).
For \(\lambda \boldsymbol\alpha^\text{T} K \boldsymbol\alpha\) part, the gradient is \(2 \lambda K \boldsymbol\alpha\)

We can replace the \(f(x_i)\) in each \(\frac{\partial L_i}{\partial \beta}\). Then the gradient of loss function will be

\[ \frac{1}{n} \bigg[\sum_{i = 1}^n\frac{\partial L_i}{\partial \boldsymbol \alpha}\bigg] + 2 \lambda K \boldsymbol \alpha \] where

\[\frac{\partial L_i}{\partial \boldsymbol \alpha} = \frac{\partial L_i}{\partial f} \frac{\partial f}{\partial \boldsymbol \alpha} = \frac{-y_i \exp[-y_i f(x_i)]}{1 + \exp[-y_i f(x_i)]} K_i = \frac{-y_i K_i}{1 + \exp[y_i f(x_i)]} \] Some other similar forms are also fine. (For example, you may multiply some constant to the \(K_{\lambda}\))

# define the objective function
svmfn = function(b, K, y, lambda){
  mean(log(1 + exp(- y * (K %*% b) ))) + lambda * t(b) %*% K %*% b
}

# define the gradient function
svmgn = function(b, K, y, lambda){
  a1 = exp(y * (K %*% b))
  a2 = -y / (1 + a1)
  return(colMeans(sweep(K, 1, a2, FUN = "*")) + K %*% b * lambda * 2)
}

# calculate the kernel matrix
sigma = 0.2
K = exp(- 0.5* as.matrix(dist(train[, 1:2]))^2 / sigma^2)

# perform optimization
lambda = 0.01
mysvmfit = optim(par = rep(0, 200), fn = svmfn, #gr = svmgn,
                 K = K, y = train[, 3], lambda = lambda, 
                 method = "BFGS")

# Loss value
print(mysvmfit$value)

## [1] 0.6036093

# training data prediction
yTrainPred = ifelse(K %*% mysvmfit$par > 0, 1, -1)
# training err
mean(yTrainPred != train[, 3])

## [1] 0.01

# confusion table
confuse_table = table(yTrainPred, train[, 3])
rownames(confuse_table) = paste0("Pred ", rownames(confuse_table))
knitr::kable(confuse_table)

	-1	1
Pred -1	82	2
Pred 1	0	116

# training data plot
plot(train[, 1], train[, 2], pch = 19, xlab = "x1", ylab = "x2", 
     col = ifelse(yTrainPred == 1, "darkorange", "deepskyblue"))

legend("topleft", c("Positive","Negative"), 
       col=c("darkorange", "deepskyblue"), pch=c(19, 19))

# use fitted coefficients to predict testing data 
Ktest = matrix(NA, 300, 200)
for (i in 1:300){
  Ktest[i, ] = exp( - rowSums((sweep(train[, 1:2], 2, test[i, 1:2], FUN = "-"))^2) /
                      2 / sigma^2)
}
  
yTestPred = ifelse(Ktest %*% mysvmfit$par >0, 1, -1)
# testing error
mean(yTestPred != test[, 3])

## [1] 0.08666667

plot(test[, 1], test[, 2], pch = 19, xlab = "x1", ylab = "x2", 
     col = ifelse(yTestPred == 1,"darkorange", "deepskyblue"))

legend("topleft", c("Positive","Negative"), 
       col=c("darkorange", "deepskyblue"), pch=c(19, 19))

lines(seq(-3, 3, 0.01), sin(pi*seq(-3, 3, 0.01)) )

STAT 542: Homework 9

Spring 2022, by Ruoqing Zhu (rqzhu)

Due: Thursday, March 31, 11:59 PM CT

Instruction

About HW9

Question 1 [45 Points] Kernel Ridge Regression

Question 2 [55 Points] Non-linear SVM as Penalized Version