Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to compass2g. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

About HW9

In this HW, we use two examples to demonstrate how the RKHS and Representer Theorem are used in practice. In both questions, we will use the radial basis kernel, defined as

\[K(\mathbf{x}, \mathbf{z}) = e^{- \frac{\lVert \mathbf{x} - \mathbf{z}\rVert^2}{2\sigma^2}}.\] You are not required to tune the parameter \(\sigma\).

Question 1 [45 Points] Kernel Ridge Regression

Let’s first generate a set of data using the following code (or some similar code in Python):

set.seed(1)
n = 500
p = 2
X = matrix(rnorm(n*p, 0, 1), n, p)
y = 2*sin(X[, 1]*2) + 2*atan(X[, 2]*4) + rnorm(n, 0, 0.5)

alldata = cbind(X, y)
train = alldata[1:200, ]
test = alldata[201:500, ]

[5 Points] As a comparison, we can first fit a ridge regression to this data. Use the train part to fit a ridge regression using the glmnet() package with 10-fold cross-validation. Use lambda.min as your tuning parameter to predict the testing data. What is the prediction error?

To fit a kernel ridge regression, we use the following formulation:

\[\underset{\boldsymbol \alpha}{\text{minimize}} \quad \frac{1}{n} \big\lVert \mathbf{y} -\mathbf{K} \boldsymbol \alpha \big\rVert^2 + \lambda \boldsymbol \alpha^\text{T} \mathbf{K} \alpha\]

It should be noted that you could add an \(\alpha_0\) term to this regression to rigorously add the intercept term. However, the theoretical mean of \(Y\) is zero. Hence, it is not a crucial issue, and not required here. Following our derivation in the class, perform the following to complete this kernel ridge regression:

Answer:

Ridge Regression

library(glmnet)
# ridge regression uses `alpha = 0`
ridge.fit = cv.glmnet(x = train[, 1:2], y = train[, 3], alpha = 0)
mean((predict(ridge.fit, test[, 1:2], s = "lambda.min") - test[, 3])^2)
## [1] 3.205833
sigma = 1
K = exp(- 0.5* as.matrix(dist(train[, 1:2]))^2 / sigma^2)

lambda = 0.01
alpha = solve(K + 200*lambda*diag(200)) %*% train[, 3]

Ktest = matrix(NA, 300, 200)
for (i in 1:300)
{
  Ktest[i, ] = exp( - rowSums((sweep(train[, 1:2], 2, test[i, 1:2], FUN = "-"))^2) / 2 / sigma^2)
}

ypred = Ktest %*% alpha
mean((test[, 3] - ypred)^2)
## [1] 0.7995524

The prediction error of kernel ridge regression is 0.80, which is smaller than that of ridge regression: 3.205. Because the signal is non-linear.

Question 2 [55 Points] Non-linear SVM as Penalized Version

Recall that in HW8, we solved SVM using a penalized loss framework, with the logistic loss function: \[L(y, f(x)) = \log(1 + e^{- y f(x)}).\]

Again, we can specify the form of \(f(\cdot)\) as in a RKHS. And the penalized objective function becomes

\[\frac{1}{n} \sum_{i=1}^n L(y_i, K_i^\text{T} \boldsymbol \alpha) + \lambda \alpha^\text{T} \mathbf{K} \alpha\] where \(\mathbf{K}\) is the same \(n \times n\) kernel matrix as before and \(K_i\) is the \(i\)th column of \(\mathbf{K}\). The data are generated by the following code.

set.seed(1)
n = 500
p = 2
X = matrix(runif(n*p, -2, 2), n, p)
y = sign( X[, 2] - sin(pi*X[, 1]))
alldata = cbind(X, y)
train = alldata[1:200, ]
test = alldata[201:500, ]

# visualize the data 
plot(train[, 1], train[, 2], col = ifelse(y > 0, "red", "blue"), 
     pch = 19, )
lines(seq(-3, 3, 0.01), sin(pi*seq(-3, 3, 0.01)) )

Similar to the HW8 linear SVM penalized version, we perform the following steps to complete this model fitting:

Note: 1) In a real problem, you can perform cross-validation to find the best tuning parameters. 2) You may also add a constant term \(\alpha_0\) to the problem to take care of intercept. These two are not required for the HW.

Answer:

The calculation of gradient is similar to HW8. Two differences are

We can replace the \(f(x_i)\) in each \(\frac{\partial L_i}{\partial \beta}\). Then the gradient of loss function will be

\[ \frac{1}{n} \bigg[\sum_{i = 1}^n\frac{\partial L_i}{\partial \boldsymbol \alpha}\bigg] + 2 \lambda K \boldsymbol \alpha \] where

\[\frac{\partial L_i}{\partial \boldsymbol \alpha} = \frac{\partial L_i}{\partial f} \frac{\partial f}{\partial \boldsymbol \alpha} = \frac{-y_i \exp[-y_i f(x_i)]}{1 + \exp[-y_i f(x_i)]} K_i = \frac{-y_i K_i}{1 + \exp[y_i f(x_i)]} \] Some other similar forms are also fine. (For example, you may multiply some constant to the \(K_{\lambda}\))

# define the objective function
svmfn = function(b, K, y, lambda){
  mean(log(1 + exp(- y * (K %*% b) ))) + lambda * t(b) %*% K %*% b
}

# define the gradient function
svmgn = function(b, K, y, lambda){
  a1 = exp(y * (K %*% b))
  a2 = -y / (1 + a1)
  return(colMeans(sweep(K, 1, a2, FUN = "*")) + K %*% b * lambda * 2)
}

# calculate the kernel matrix
sigma = 0.2
K = exp(- 0.5* as.matrix(dist(train[, 1:2]))^2 / sigma^2)

# perform optimization
lambda = 0.01
mysvmfit = optim(par = rep(0, 200), fn = svmfn, #gr = svmgn,
                 K = K, y = train[, 3], lambda = lambda, 
                 method = "BFGS")

# Loss value
print(mysvmfit$value)
## [1] 0.6036093
# training data prediction
yTrainPred = ifelse(K %*% mysvmfit$par > 0, 1, -1)
# training err
mean(yTrainPred != train[, 3])
## [1] 0.01
# confusion table
confuse_table = table(yTrainPred, train[, 3])
rownames(confuse_table) = paste0("Pred ", rownames(confuse_table))
knitr::kable(confuse_table)
-1 1
Pred -1 82 2
Pred 1 0 116
# training data plot
plot(train[, 1], train[, 2], pch = 19, xlab = "x1", ylab = "x2", 
     col = ifelse(yTrainPred == 1, "darkorange", "deepskyblue"))

legend("topleft", c("Positive","Negative"), 
       col=c("darkorange", "deepskyblue"), pch=c(19, 19))

# use fitted coefficients to predict testing data 
Ktest = matrix(NA, 300, 200)
for (i in 1:300){
  Ktest[i, ] = exp( - rowSums((sweep(train[, 1:2], 2, test[i, 1:2], FUN = "-"))^2) /
                      2 / sigma^2)
}
  
yTestPred = ifelse(Ktest %*% mysvmfit$par >0, 1, -1)
# testing error
mean(yTestPred != test[, 3])
## [1] 0.08666667
plot(test[, 1], test[, 2], pch = 19, xlab = "x1", ylab = "x2", 
     col = ifelse(yTestPred == 1,"darkorange", "deepskyblue"))

legend("topleft", c("Positive","Negative"), 
       col=c("darkorange", "deepskyblue"), pch=c(19, 19))

lines(seq(-3, 3, 0.01), sin(pi*seq(-3, 3, 0.01)) )