Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, there are two basic rules:
Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. Please refer to the course website for late submission policy and grading rubrics.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format will not be
accepted because they are often not readable on gradescope.
Make all of your R
code chunks visible for
grading.R
is \(\geq 4.0.0\). This
will ensure your random seed generation is the same as everyone
else..Rmd
file
as a template, be sure to remove this instruction
section.During our lecture, we considered a simulation model to analyze the variable selection property of Lasso. Now let’s further investigate the prediction error of both Lasso and Ridge, and understand the bias-variance trade-off. Consider the linear model defined as:
\[ Y = X^\text{T} \boldsymbol \beta + \epsilon \]
Where \(\boldsymbol \beta = (\beta_1, \beta_2, \ldots, \beta_{100})^T\) with \(\beta_1 = \beta_{11} = \beta_{21} = \beta_{31} = 0.4\) and all other \(\beta\) parameters set to zero. The noise term \(\epsilon \sim {\cal N}(0,1)\) is independent of \(X\). The \(p\)-dimensional covariate \(X\) follows a multivariate Gaussian distribution:
\[ \mathbf{X} \sim {\cal N}(\mathbf{0}, \Sigma_{p\times p}). \]
In \(\Sigma\), all diagonal elements are 1, and all off-diagonal elements are \(\rho\).
[10 points] A single Simulation Run
cv.glmnet()
on the training data with
10-fold cross-validation. Use lambda.1se
to select the
optimal \(\lambda\).[15 points] Higher Correlation and Multiple Simulation Runs
[15 points] Ridge Regression
During our lecture, we considered a simulation model to analyze the variable selection property of Lasso. Now let’s further investigate the prediction error cased by the \(L1\) penalty under this model, and understand the bias-variance trade-off. For this question, your underlying true data generate should be
\[\begin{align} Y =& X^\text{T} \boldsymbol \beta + \epsilon \\ =& \sum_{j = 1}^p X_j 0.4^\sqrt{j} + \epsilon, \end{align}\]
where p
\(= 30\), each
\(X_j\) is generated independently from
\(\cal{N}(0, 1)\), and \(\epsilon\) also follows a standard normal,
independent from \(X\). The goal is to
predict two target points and investigate how the prediction error
changes under different penalties. The training data and two target
testing points are defined by the following code.
# target testing points
p = 30
xa = xb = rep(0, p)
xa[2] = 1
xb[10] = 1
Perform the following questions:
exp(seq(-5, 5, 0.05))
nsim
\(= 200\) independent runs, with
n
\(= 100\) observations in
each run.
glmnet()
function to fit Lasso on the \(\lambda\) valuesxb
should be much larger than xa
. What are the
corresponding best \(\lambda\) and
Error for each target point?xb
is
larger than xa
. Hint: pay attention to their covariate
values and the associated \(\widehat
\beta\) parameters. Discuss how their predictions would trade
bias and variance differently.In this question, we will predict the number of applications received
using the variables in the College dataset that can be found in
ISLR2
package. The output variable will be the number of
applications (Apps) and the other variables are predictors. If you use
Python, consider migrating the data to an excel file and read it in
Python.
lm()
, and report the test error (i.e., testing
MSE). library(ISLR2)
data(College)
# generate the indices for the testing data
set.seed(7)
test_idx = sample(nrow(College), 177)
train = College[-test_idx,]
test = College[test_idx,]
[5 pts] Compare Lasso and Ridge regression on this problem. Train
the model using cross-validation on the training set. Report the test
error for both Lasso and Ridge regression. Use lambda.min
and lambda.1se
to select the optimal \(\lambda\) for both methods.
[15 pts] The glmnet
package implemented a new
feature called relaxed
fits and the associated tuning
parameter gamma
. You can find some brief explanation of
this feature at the documentation of this package. See
Read these documentations regarding the gamma
parameter,
and summarize the idea of this feature in terms of the loss function
being used. You need to write it specifically in terms of the data
vectors \(\mathbf y\) and matrix \(\mathbf X\) and define any notations you
need. Only consider the Lasso penalty for this question.
After this, implement this feature and utilize the cross-validation to find the optimal \(\lambda\) and \(\gamma\) for the College dataset. Report the test error for the optimal model.
In HW3, we used golub
dataset from the
multtest
package. This dataset contains 3051 genes from 38
tumor mRNA samples from the leukemia microarray study Golub et
al. (1999). The outcome golub.cl
is an indicator for two
leukemia types: Acute Lymphoblastic Leukemia (ALL) or Acute Myeloid
Leukemia (AML). In genetic analysis, many gene expressions are highly
correlated. Hence we could consider the Elastic net model for both
sparsity and correlation.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("multtest")
Fit logistic regression to this dataset. Use a grid of \(\alpha\) values in \([0, 1]\) and report the best \(\alpha\) and \(\lambda\) values using 19-fold cross-validation.