Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to compass2g. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
What is expected for the submission to Gradescope
HWx_yourNetID.pdf
. For example, HW01_rqzhu.pdf
. Please note that this must be a .pdf
file generated by a .Rmd
file. .html
format cannot be accepted.Please note that your homework file is a PDF report instead of a messy collection of R codes. This report should include:
Ruoqing Zhu(rqzhu)
by your name and NetID if you are using this template).R
code chunks visible for grading.R
code chunks that support your answers.Answer: I fit SVM with the following choice of tuning parameters ...
Requirements regarding the .Rmd
file.
Rmd
files. However, your PDF file should be rendered directly from it.We utilize the coordinate descent algorithm introduced in the class to implement the entire Lasso solution. For coordinate descent, you may also want to review HW4. This HW involves two steps: in the first step, we solve the solution for a fixed \(\lambda\) value, while in the second step, we consider a sequence of \(\lambda\) values and solve it using the path-wise coordinate descent.
For this question, you cannot use functions from any additional library, except the MASS
package, which is used to generate multivariate normal data. Following HW4, we use the this version of the objective function:
\[\arg\min_{\beta} \frac{1}{n} \lVert \mathbf{y} - \mathbf{X}\beta \rVert^2 + \lambda \lVert \beta \rVert_1\] The following data is used to fit this model. You can consider using similar functions in Python if needed. We use
library(MASS)
set.seed(10)
n = 100
p = 200
# generate data
V = matrix(0.3, p, p)
diag(V) = 1
X_org = as.matrix(mvrnorm(n, mu = rep(0, p), Sigma = V))
true_b = c(1:3, -3:-1, rep(0, p-6))
y_org = X_org %*% true_b + rnorm(n)
# pre-scaling and centering X and y
X = scale(X_org)*sqrt(n/(n-1))
y = scale(y_org)*sqrt(n/(n-1))
lambda = 0.3
[10 pts] State the solution \(x\) of the following problem \[
\underset{x}{\arg \min} \,\, (x-b)^{2}+\lambda|x|, \quad \lambda>0
\] Then, implement a function in the form of soft_th <- function(b, lambda)
to return the result of the above problem. Note in the coordinate descent discussed in the slides, where \(b\) is an OLS estimator, and \(\lambda\) is the penalty parameter. Report the function output for the following testing cases with \(\lambda = 0.3\): 1) \(b = 1\); 2) \(b = -1\); 3) \(b = -0.1\).
[40 pts] We will use the pre-scale and centered data X
and y
for this question, hence no intercept is needed. Write a Lasso algorithm function myLasso(X, y, lambda, beta_init, tol, maxitr)
, which return two outputs (as a list with two components):
You need to consider the following while completing this question:
beta_init
: \(\boldsymbol \beta = \mathbf{0}_{p \times 1}\)maxitr
= 100 iterations. Each iteration should loop through all variables.tol
. This means terminating the algorithm when the \(\boldsymbol \beta\) value of the current iteration is sufficiently similar to the previous one, i.e., \(\lVert \boldsymbol \beta^{(k)} - \boldsymbol \beta^{(k-1)} \rVert^2 \leq \text{tol}\).Aftering completing your code, run it on the data we generated previously. Provide the following results:
glmnet
package using the following code. You should report their first 8 coefficients and the \(L_1\) norm of the difference \(\| \, \hat{\boldsymbol\beta}^\text{glment}_{[1:8]} - \hat{\boldsymbol\beta}^\text{yours}_{[1:8]} \, \|_1\). library(glmnet)
# glmnetfit use a different loss function. Use lambda / 2 as the penalty
glmnetfit = glmnet(X, y, lambda = lambda / 2, intercept = FALSE)
glmnetfit$beta[1:8]
## [1] 0.000000000 0.158233309 0.429648259 -0.519431173 -0.171467909 -0.006652507
## [7] 0.000000000 0.000000000
Answer:
The soft-thresholding function is shrinking the magnitude of the parameter by \(\lambda/2\).
soft_th <- function(b, lambda){
return(sign(b) * max(0, abs(b) - lambda / 2))
}
b = c(1, -2, -0.1)
x = b
for (i in 1:3){
x[i] = soft_th(b[i], lambda)
}
print(x)
## [1] 0.85 -1.85 0.00
The soft-thresholding results are 0.85, -1.85, 0.
We write our own lasso algorithm
myLasso <- function(X, y, lambda, beta_init, tol = 1e-5, maxitr = 100)
{
n = nrow(X)
p = ncol(X)
beta_new = beta_old = beta_init
# the grand loop
for (k in 1:maxitr){
# save beta_old to check later
beta_old = beta_new
# the current residual
r = y - X %*% beta_old
# update each beta_j in a loop
for (j in 1:p){
# one-dimensional regression outcome r by removing the effect of others except j
r = r + X[, j] * beta_new[j]
# apply the soft-thresholding function to update beta_j
beta_new[j] <- soft_th( X[, j] %*% r / n, lambda)
# update the effect of beta_j back to r
r = r - X[, j] * beta_new[j]
}
# after updating all beta, check the change of parameters
if (sum((beta_new - beta_old)^2) < tol) break;
}
return(list("itr" = k, "beta" = beta_new))
}
# fit the model
myfit = myLasso(X, y, beta_init = rep(0, p), lambda = 0.3)
# the number of iterations and the value of parameters
cat(" Number of iter: ", myfit$itr, "\n",
"8 Coefs: ", myfit$beta[1:8])
## Number of iter: 6
## 8 Coefs: 0 0.158242 0.4295867 -0.5192701 -0.1712694 -0.006770245 0 0
We check the \(L_1\) norm of difference between our solution and glmnet
sum(abs(myfit$beta[1:8] - glmnetfit$beta[1:8]))
## [1] 0.000547486
The number of iterations is 6, and the parameters are printed above. The \(L_1\) norm of difference on the first 8 coefficients is 0.000547486. Our solution is very close to that of glmnet
.
Let’s perform path-wise coordinate descent. The idea is simple: we will solve the solution on a sequence of \(\lambda\) values, starting from the largest one in the sequence. The first initial \(\boldsymbol\beta\) are still all zero. After obtaining the optimal \(\boldsymbol \beta\) for a given \(\lambda\), we simply use this solution as the initial value for the next, smaller \(\lambda\). This is referred to as a warm start in optimization problems. We will consider the following sequence of \(\lambda\) borrowed from glmnet
. Note that this is a decreasing sequence from large to small values.
glmnetfit = glmnet(X, y, intercept = FALSE)
# Again, twice lambda is used for our function
lambda_all = glmnetfit$lambda * 2
# a matplot of the first 8 coefficients vs log scale of lambda
matplot(log(lambda_all), t(glmnetfit$beta[1:8, ]), type = "l", lwd = 2,
xlab = "Log Lambda",ylab = "Estimated Beta", main = "glmnet")
legend("topleft", paste("beta", 1:8, "=", c(1:3, -3:-1, 0, 0)),
col = 1:8, lty = 1:8, lwd = 2)
[20 pts] Write a function myLasso_pw <- function(X, y, lambda_all, tol, maxitr)
, which output a \(p \times N_{\lambda}\) matrix. \(N_{\lambda}\) is the number of unique \(\lambda\) values. Also follow the above instruction at the beginning of this question to include the warm start for path-wise solution. Your myLasso_pw
should make use of your myLasso
in Question 1.
[5 pts] Provide the same plot as the above glmnet
solution plot of the first 8 parameter in your solution path. Make the two plots side-by-side (e.g. par(mfrow = c(1, 2)
in R
) with glmnet
on the left and your solution path on the right.
[5 pts] Based on your plot, if we decrease \(\lambda\) from its maximum value, which two variables enter (start to have nonzero values) the model first? You may denote your covariates as \(X_1, ..., X_8\).
[5 pts] In Question 1, we calculated the L1 norm discrepancy between our solution and glmnet
on the first 8 parameters. In this question, we will calculate the discrepancy on all coefficients, and over all \(\lambda\) parameters. After calculating the discrepancies, show a scatter plot of log(\(\lambda\)) vs. discrepancy. Comment on what you observe.
[15 pts] Based on the solution you obtained in the previous question, recover the unscaled coefficients using the formula in HW4. Then compare the first 9 coefficients (including the intercept term) with the following using a similar plot in b). Report the maximum value of discrepancy (see d) across all \(\lambda\) values.
glmnetfit2 = glmnet(X_org, y_org, lambda = lambda_all/2*sd(y_org)*sqrt(n/(n-1)))
lassobeta2 = coef(glmnetfit2)[1:9, ]
Answer:
We utilize the myLasso
function from the previous function to implement the path-wise solution.
myLasso_pw <- function(X, y, lambda_all, tol = 1e-7, maxitr = 100){
# initialize
n = nrow(X)
p = ncol(X)
nlambda = length(lambda_all)
lambda_all = sort(lambda_all, decreasing = TRUE)
# save parameters
beta_mat = matrix(NA, p, nlambda)
beta_init = rep(0, p)
# path-wise coordinate descent starting with the largest
for (l in 1:nlambda){
# the lambda in the current iteration
lambda = lambda_all[l]
myBeta = myLasso(X, y, lambda, beta_init, tol, maxitr)$beta
# save the lambda value
beta_mat[, l] = myBeta
beta_init = myBeta
}
return(beta_mat)
}
# fit the model
myfit = myLasso_pw(X, y, lambda_all)
We compare the two solution path:
par(mfrow = c(1, 2))
# a matplot of the first 8 coefficients vs log scale of lambda
matplot(log(lambda_all), t(glmnetfit$beta[1:8, ]), type = "l", lwd = 2,
xlab = "Log Lambda",ylab = "Estimated Beta", main = "glmnet")
legend("topleft", paste("beta", 1:8, "=", c(1:3, -3:-1, 0, 0)),
col = 1:8, lty = 1:8, lwd = 2)
# a matplot of the first 8 coefficients vs log scale of lambda
matplot(log(lambda_all), t(myfit[1:8, ]), type = "l", lwd = 2,
xlab = "Log Lambda",ylab = "Estimated Beta", main = "My Lasso")
legend("topleft", paste("beta", 1:8, "=", c(1:3, -3:-1, 0, 0)),
col = 1:8, lty = 1:8, lwd = 2)
Based on the plot, \(X_3\) and \(X_4\) enter the model first. They also have the largest coefficients.
beta.diff <- myfit - as.matrix(glmnetfit$beta)
plot(log(lambda_all), colSums(abs(beta.diff)), ylab = "L1 norm, all coefficients: | beta - beta_glm | ")
The plot is presented in the above. The discrepancy is larger when \(\lambda\) is smaller. This is possibly caused by the accumulation or numerical error when penalty is small. Since many parameters enters the model when \(\lambda\) is small. This can be seen form the figure below.
plot(log(lambda_all), colSums(myfit!=0), ylab = "Number of Nonzero Parameters")
# the glmnet solution
glmnetfit2 = glmnet(X_org, y_org,
lambda = lambda_all/2*sd(y_org)*sqrt(n/(n-1)))
lassobeta2 = coef(glmnetfit2)[1:9, ]
lassobeta2.all = coef(glmnetfit2)
# recover the solution using our code
b = sweep(myfit, 1, apply(X_org, 2, sd), FUN = "/")*sd(y_org)
beta0 = mean(y_org) - colSums(sweep(myfit, 1, apply(X_org, 2, sd) / apply(X_org, 2, mean),FUN = "/")*sd(y_org))
coefRecover = rbind(beta0, b)
# plot and compare the solutions
par(mfrow = c(1, 2))
# a matplot of the first 9 coefficients vs log scale of lambda
matplot(log(lambda_all), t(as.matrix(lassobeta2)), type = "l", lwd = 2,
xlab = "Log Lambda",ylab = "Estimated Beta", main = "glmnet")
legend("topleft", paste("beta", 1:9, "=", c(0, 1:3, -3:-1, 0, 0)),
col = 1:9, lty = 1:9, lwd = 2)
# a matplot of the first 8 coefficients vs log scale of lambda
matplot(log(lambda_all), t(coefRecover[1:9, ]), type = "l", lwd = 2,
xlab = "Log Lambda",ylab = "Estimated Beta", main = "My Lasso (recovered)")
legend("topleft", paste("beta", 1:8, "=", c(0, 1:3, -3:-1, 0, 0)),
col = 1:9, lty = 1:9, lwd = 2)
The solution path matches almost exactly visually. We can then calculate the discrepancy.
beta.diff2 <- coefRecover - as.matrix(lassobeta2.all)
max_diff = max(colSums(abs(beta.diff2)))
print(max_diff)
## [1] 0.2883592
The maximum of discrepancy is 0.2883592.