Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to compass2g. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
What is expected for the submission to Gradescope
HWx_yourNetID.pdf
. For example, HW01_rqzhu.pdf
. Please note that this must be a .pdf
file generated by a .Rmd
file. .html
format cannot be accepted.Please note that your homework file is a PDF report instead of a messy collection of R codes. This report should include:
Ruoqing Zhu (rqzhu)
by your name and NetID if you are using this template).R
code chunks visible for grading.R
code chunks that support your answers.Answer: I fit SVM with the following choice of tuning parameters ...
Requirements regarding the .Rmd
file.
Rmd
files. However, your PDF file should be rendered directly from it.In this HW, we use two examples to demonstrate how the RKHS and Representer Theorem are used in practice. In both questions, we will use the radial basis kernel, defined as
\[K(\mathbf{x}, \mathbf{z}) = e^{- \frac{\lVert \mathbf{x} - \mathbf{z}\rVert^2}{2\sigma^2}}.\] You are not required to tune the parameter \(\sigma\).
Let’s first generate a set of data using the following code (or some similar code in Python):
set.seed(1)
n = 500
p = 2
X = matrix(rnorm(n*p, 0, 1), n, p)
y = 2*sin(X[, 1]*2) + 2*atan(X[, 2]*4) + rnorm(n, 0, 0.5)
alldata = cbind(X, y)
train = alldata[1:200, ]
test = alldata[201:500, ]
[5 Points] As a comparison, we can first fit a ridge regression to this data. Use the train
part to fit a ridge regression using the glmnet()
package with 10-fold cross-validation. Use lambda.min
as your tuning parameter to predict the testing data. What is the prediction error?
To fit a kernel ridge regression, we use the following formulation:
\[\underset{\boldsymbol \alpha}{\text{minimize}} \quad \frac{1}{n} \big\lVert \mathbf{y} -\mathbf{K} \boldsymbol \alpha \big\rVert^2 + \lambda \boldsymbol \alpha^\text{T} \mathbf{K} \alpha\]
It should be noted that you could add an \(\alpha_0\) term to this regression to rigorously add the intercept term. However, the theoretical mean of \(Y\) is zero. Hence, it is not a crucial issue, and not required here. Following our derivation in the class, perform the following to complete this kernel ridge regression:
Answer:
Ridge Regression
library(glmnet)
# ridge regression uses `alpha = 0`
ridge.fit = cv.glmnet(x = train[, 1:2], y = train[, 3], alpha = 0)
mean((predict(ridge.fit, test[, 1:2], s = "lambda.min") - test[, 3])^2)
## [1] 3.205833
sigma = 1
K = exp(- 0.5* as.matrix(dist(train[, 1:2]))^2 / sigma^2)
lambda = 0.01
alpha = solve(K + 200*lambda*diag(200)) %*% train[, 3]
Ktest = matrix(NA, 300, 200)
for (i in 1:300)
{
Ktest[i, ] = exp( - rowSums((sweep(train[, 1:2], 2, test[i, 1:2], FUN = "-"))^2) / 2 / sigma^2)
}
ypred = Ktest %*% alpha
mean((test[, 3] - ypred)^2)
## [1] 0.7995524
The prediction error of kernel ridge regression is 0.80, which is smaller than that of ridge regression: 3.205. Because the signal is non-linear.
Recall that in HW8, we solved SVM using a penalized loss framework, with the logistic loss function: \[L(y, f(x)) = \log(1 + e^{- y f(x)}).\]
Again, we can specify the form of \(f(\cdot)\) as in a RKHS. And the penalized objective function becomes
\[\frac{1}{n} \sum_{i=1}^n L(y_i, K_i^\text{T} \boldsymbol \alpha) + \lambda \alpha^\text{T} \mathbf{K} \alpha\] where \(\mathbf{K}\) is the same \(n \times n\) kernel matrix as before and \(K_i\) is the \(i\)th column of \(\mathbf{K}\). The data are generated by the following code.
set.seed(1)
n = 500
p = 2
X = matrix(runif(n*p, -2, 2), n, p)
y = sign( X[, 2] - sin(pi*X[, 1]))
alldata = cbind(X, y)
train = alldata[1:200, ]
test = alldata[201:500, ]
# visualize the data
plot(train[, 1], train[, 2], col = ifelse(y > 0, "red", "blue"),
pch = 19, )
lines(seq(-3, 3, 0.01), sin(pi*seq(-3, 3, 0.01)) )
Similar to the HW8 linear SVM penalized version, we perform the following steps to complete this model fitting:
SVMfn(b, K, y, lambda)
corresponding to the form we specified. Make sure to use the \(1/n\) scale in the loss function part.SVMgr(b, K, y, lambda)
to implement it.optim()
function to solve this optimization problem by supplying the objective function and gradient function. Use \(\sigma = 0.2\) and \(\lambda = 0.01\). Initialize your coefficients as zeros. Report the following:
Note: 1) In a real problem, you can perform cross-validation to find the best tuning parameters. 2) You may also add a constant term \(\alpha_0\) to the problem to take care of intercept. These two are not required for the HW.
Answer:
The calculation of gradient is similar to HW8. Two differences are
We can replace the \(f(x_i)\) in each \(\frac{\partial L_i}{\partial \beta}\). Then the gradient of loss function will be
\[ \frac{1}{n} \bigg[\sum_{i = 1}^n\frac{\partial L_i}{\partial \boldsymbol \alpha}\bigg] + 2 \lambda K \boldsymbol \alpha \] where
\[\frac{\partial L_i}{\partial \boldsymbol \alpha} = \frac{\partial L_i}{\partial f} \frac{\partial f}{\partial \boldsymbol \alpha} = \frac{-y_i \exp[-y_i f(x_i)]}{1 + \exp[-y_i f(x_i)]} K_i = \frac{-y_i K_i}{1 + \exp[y_i f(x_i)]} \] Some other similar forms are also fine. (For example, you may multiply some constant to the \(K_{\lambda}\))
# define the objective function
svmfn = function(b, K, y, lambda){
mean(log(1 + exp(- y * (K %*% b) ))) + lambda * t(b) %*% K %*% b
}
# define the gradient function
svmgn = function(b, K, y, lambda){
a1 = exp(y * (K %*% b))
a2 = -y / (1 + a1)
return(colMeans(sweep(K, 1, a2, FUN = "*")) + K %*% b * lambda * 2)
}
# calculate the kernel matrix
sigma = 0.2
K = exp(- 0.5* as.matrix(dist(train[, 1:2]))^2 / sigma^2)
# perform optimization
lambda = 0.01
mysvmfit = optim(par = rep(0, 200), fn = svmfn, #gr = svmgn,
K = K, y = train[, 3], lambda = lambda,
method = "BFGS")
# Loss value
print(mysvmfit$value)
## [1] 0.6036093
# training data prediction
yTrainPred = ifelse(K %*% mysvmfit$par > 0, 1, -1)
# training err
mean(yTrainPred != train[, 3])
## [1] 0.01
# confusion table
confuse_table = table(yTrainPred, train[, 3])
rownames(confuse_table) = paste0("Pred ", rownames(confuse_table))
knitr::kable(confuse_table)
-1 | 1 | |
---|---|---|
Pred -1 | 82 | 2 |
Pred 1 | 0 | 116 |
# training data plot
plot(train[, 1], train[, 2], pch = 19, xlab = "x1", ylab = "x2",
col = ifelse(yTrainPred == 1, "darkorange", "deepskyblue"))
legend("topleft", c("Positive","Negative"),
col=c("darkorange", "deepskyblue"), pch=c(19, 19))
# use fitted coefficients to predict testing data
Ktest = matrix(NA, 300, 200)
for (i in 1:300){
Ktest[i, ] = exp( - rowSums((sweep(train[, 1:2], 2, test[i, 1:2], FUN = "-"))^2) /
2 / sigma^2)
}
yTestPred = ifelse(Ktest %*% mysvmfit$par >0, 1, -1)
# testing error
mean(yTestPred != test[, 3])
## [1] 0.08666667
plot(test[, 1], test[, 2], pch = 19, xlab = "x1", ylab = "x2",
col = ifelse(yTestPred == 1,"darkorange", "deepskyblue"))
legend("topleft", c("Positive","Negative"),
col=c("darkorange", "deepskyblue"), pch=c(19, 19))
lines(seq(-3, 3, 0.01), sin(pi*seq(-3, 3, 0.01)) )