Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, there are two basic rules:
Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. Please refer to the course website for late submission policy and grading rubrics.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format will not be
accepted because they are often not readable on gradescope.
Make all of your R
code chunks visible for
grading.R
is \(\geq 4.0.0\). This
will ensure your random seed generation is the same as everyone
else..Rmd
file
as a template, be sure to remove this instruction
section.During our lecture, we considered a simulation model to analyze the variable selection property of Lasso. Now let’s further investigate the prediction error of both Lasso and Ridge, and understand the bias-variance trade-off. Consider the linear model defined as:
\[ Y = X^\text{T} \boldsymbol \beta + \epsilon \]
Where \(\boldsymbol \beta = (\beta_1, \beta_2, \ldots, \beta_{100})^T\) with \(\beta_1 = \beta_{11} = \beta_{21} = \beta_{31} = 0.4\) and all other \(\beta\) parameters set to zero. The noise term \(\epsilon \sim {\cal N}(0,1)\) is independent of \(X\). The \(p\)-dimensional covariate \(X\) follows a multivariate Gaussian distribution:
\[ \mathbf{X} \sim {\cal N}(\mathbf{0}, \Sigma_{p\times p}). \]
In \(\Sigma\), all diagonal elements are 1, and all off-diagonal elements are \(\rho\).
[10 points] A single Simulation Run
cv.glmnet()
on the training data with
10-fold cross-validation. Use lambda.1se
to select the
optimal \(\lambda\).Answer:
# Load required packages
library(glmnet)
library(MASS)
# Set random seed for reproducibility
set.seed(432)
# Initialize parameters
p = 100
beta = rep(0, p)
beta[c(1, 11, 21, 31)] = 0.4
rho = 0.1
ntrain = 200
ntest = 500
# Generate covariance matrix Sigma
Sigma = matrix(rho, p, p)
diag(Sigma) = 1
# Generate training and testing data
xtrain = mvrnorm(ntrain, rep(0, p), Sigma)
xtest = mvrnorm(ntest, rep(0, p), Sigma)
ytrain = xtrain %*% beta + rnorm(ntrain)
ytest = xtest %*% beta + rnorm(ntest)
# Fit Lasso regression model using 10-fold CV
lasso.fit = cv.glmnet(xtrain, ytrain, nfolds = 10)
# Make predictions and calculate test error for Lasso
lasso.pred = predict(lasso.fit, xtest, s = "lambda.1se")
lasso.error = mean((ytest - lasso.pred)^2)
# Check if the correct model was selected by Lasso
lasso.beta = as.vector(coef(lasso.fit, s = "lambda.1se"))[-1]
lasso.correct = all((lasso.beta != 0) == (beta != 0))
lasso.error
## [1] 1.19458
lasso.correct
## [1] FALSE
Hence, the Lasso model gives us prediction error 1.1945805. The correct model was selected by Lasso: FALSE.
[15 points] Higher Correlation and Multiple Simulation Runs
Answer:
set.seed(432)
nsim <- 100
rho_all <- c(0.1, 0.3, 0.5, 0.7, 0.9)
error <- numeric(length(rho_all))
correct <- numeric(length(rho_all)) # numeric, not logical
for (i in seq_along(rho_all)) {
rho <- rho_all[i]
Sigma <- matrix(rho, p, p); diag(Sigma) <- 1
error_all <- numeric(nsim)
correct_all <- logical(nsim)
for (j in 1:nsim) {
xtrain <- mvrnorm(ntrain, rep(0, p), Sigma)
xtest <- mvrnorm(ntest, rep(0, p), Sigma)
ytrain <- xtrain %*% beta + rnorm(ntrain)
ytest <- xtest %*% beta + rnorm(ntest)
lasso.fit <- cv.glmnet(xtrain, ytrain, nfolds = 10, alpha = 1,
family = "gaussian", type.measure = "mse")
lasso.pred <- predict(lasso.fit, xtest, s = "lambda.1se")
error_all[j] <- mean((ytest - lasso.pred)^2)
lbeta <- as.vector(coef(lasso.fit, s = "lambda.1se"))[-1]
correct_all[j] <- all((lbeta != 0) == (beta != 0))
}
error[i] <- mean(error_all)
correct[i] <- mean(correct_all) # proportion in [0,1]
}
error
## [1] 1.178422 1.182164 1.175147 1.159814 1.160155
correct
## [1] 0.32 0.07 0.01 0.00 0.00
As \(\rho\) increases, the proportion of correct estimations decreases. This is mainly because Lasso can be unstable when facing correlated variables, and essentially, using one or its correlated ones may lead to similar loss. Hence it may falsely select a variable instead of the original one. However, this does not affect the prediction error significantly, as the average prediction error remains relatively stable across different values of \(\rho\). This is mainly because correlated variables still have similar predictive power.
[15 points] Ridge Regression
Answer:
nsim = 100
rho_all = c(0.1, 0.3, 0.5, 0.7, 0.9)
error = rep(0, 5)
for (i in 1:length(rho_all)) {
rho = rho_all[i]
Sigma = matrix(rho, p, p)
diag(Sigma) = 1
error_all = rep(0, nsim)
for (j in 1:nsim) {
xtrain = mvrnorm(ntrain, rep(0, p), Sigma)
xtest = mvrnorm(ntest, rep(0, p), Sigma)
ytrain = xtrain %*% beta + rnorm(ntrain)
ytest = xtest %*% beta + rnorm(ntest)
ridge.fit = cv.glmnet(xtrain, ytrain, nfolds = 10, alpha = 0)
ridge.pred = predict(ridge.fit, xtest, s = "lambda.1se")
error_all[j] = mean((ytest - ridge.pred)^2)
}
# record
error[i] = mean(error_all)
}
error
## [1] 1.459775 1.389741 1.336244 1.271737 1.168856
The average prediction error for the ridge regression is decreasing as the correlation increases. This is mainly because ridge regression utilize the correlation between variables to improve the prediction accuracy. The ridge regression can be interpreted as performing PCA on the design matrix and then project the response onto the principal components, and then shrink the coefficients of these projections. When the correlation increases, the principal components can capture more variance of the response. And the true model can be pretty much captured by the first principal direction as the correlation increases, hence the prediction accuracy can be improved.
During our lecture, we considered a simulation model to analyze the variable selection property of Lasso. Now let’s further investigate the prediction error cased by the \(L1\) penalty under this model, and understand the bias-variance trade-off. For this question, your underlying true data generate should be
\[\begin{align} Y =& X^\text{T} \boldsymbol \beta + \epsilon \\ =& \sum_{j = 1}^p X_j 0.4^\sqrt{j} + \epsilon, \end{align}\]
where p
\(= 30\), each
\(X_j\) is generated independently from
\(\cal{N}(0, 1)\), and \(\epsilon\) also follows a standard normal,
independent from \(X\). The goal is to
predict two target points and investigate how the prediction error
changes under different penalties. The training data and two target
testing points are defined by the following code.
# target testing points
p = 30
xa = xb = rep(0, p)
xa[2] = 1
xb[10] = 1
Perform the following questions:
exp(seq(-5, 5, 0.05))
nsim
\(= 200\) independent runs, with
n
\(= 100\) observations in
each run.
glmnet()
function to fit Lasso on the \(\lambda\) valuesxb
should be much larger than xa
. What are the
corresponding best \(\lambda\) and
Error for each target point?xb
is
larger than xa
. Hint: pay attention to their covariate
values and the associated \(\widehat
\beta\) parameters. Discuss how their predictions would trade
bias and variance differently.Answer:
set.seed(1)
library(glmnet)
n = 100
p = 30
all_lambda = exp(seq(-5, 5, 0.05))
nsim = 200
coefs = 0.4^sqrt(c(1:p))
# testing data
xtest = rbind(xa, xb)
# store the errors
errors = array(NA, dim = c(2, length(all_lambda), nsim))
# repeat the simulation
for (i in 1:nsim){
# the design matrix
x = matrix(rnorm(n*p), n, p)
y = x %*% coefs + rnorm(n)
# outcome
ytest = as.vector(xtest %*% coefs)
# fit the lasso and predict
glmnetfit = glmnet(x, y, lambda = all_lambda)
errors[,,i] = (predict(glmnetfit, xtest, all_lambda) - ytest)^2
}
# average the errors over all simulation runs
mean_errors = apply(errors, c(1, 2), mean)
xa_error = mean_errors[1, ]
xb_error = mean_errors[2, ]
# plot the errors
plot(log(all_lambda), mean_errors[1, ], type = "l", xlab = "lambda",
ylab = "error", ylim = c(0, 0.1))
points(log(all_lambda), mean_errors[2, ], col = "red", type = "l")
legend("topleft", legend = c("xa", "xb"), col = c("black", "red"), lty = 1)
# optimal lambda and error
lambda_a = all_lambda[which.min(xa_error)]
best_err_a = min(xa_error)
lambda_b = all_lambda[which.min(xb_error)]
best_err_b = min(xb_error)
lambda_err = matrix(c(lambda_a, lambda_b, best_err_a, best_err_b), 2, 2)
rownames(lambda_err) = c("xa", "xb")
colnames(lambda_err) = c("optimal lambda", "optimal error")
knitr::kable(data.frame(lambda_err), table.attr = "style='width:40%;'",
format = "html", digits = 5)
optimal.lambda | optimal.error | |
---|---|---|
xa | 0.02872 | 0.02193 |
xb | 0.10026 | 0.01161 |
The prediction for xa
relies heavily on the estimation
of a large parameter (the second entry), and a small amount of penalty
would cause a relatively larger bias. As contrast, prediction of
xb
does not use most of the parameters, and even applying a
large parameter will simply set its estimate to 0. Hence, using a larger
penalty is only reducing the variance further, but does not introduce
additional bias. This can also be seen in the figure that the maximum
error for xa
is much bigger than xb
, because
there is a lot of bias to introduce if we use a large penalty. Hence its
preferably to use a large \(\lambda\)
when predicting xb
and a smaller \(\lambda\) for xa
to better
balance the bias-variance trade-off.
In this question, we will predict the number of applications received
using the variables in the College dataset that can be found in
ISLR2
package. The output variable will be the number of
applications (Apps) and the other variables are predictors. If you use
Python, consider migrating the data to an excel file and read it in
Python.
lm()
, and report the test error (i.e., testing
MSE). library(ISLR2)
##
## Attaching package: 'ISLR2'
## The following object is masked from 'package:MASS':
##
## Boston
data(College)
# generate the indices for the testing data
set.seed(7)
test_idx = sample(nrow(College), 177)
train = College[-test_idx,]
test = College[test_idx,]
Answer:
model_linear <- lm(Apps ~ ., data = train)
summary(model_linear)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3644.2 -439.4 -23.2 331.8 6997.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -421.53187 479.06765 -0.880 0.379276
## PrivateYes -481.17081 165.98099 -2.899 0.003885 **
## Accept 1.63866 0.04557 35.959 < 2e-16 ***
## Enroll -1.01396 0.23640 -4.289 2.10e-05 ***
## Top10perc 58.07527 6.79021 8.553 < 2e-16 ***
## Top25perc -19.63197 5.35400 -3.667 0.000268 ***
## F.Undergrad 0.06800 0.04301 1.581 0.114440
## P.Undergrad 0.01053 0.04782 0.220 0.825724
## Outstate -0.08698 0.02271 -3.830 0.000142 ***
## Room.Board 0.14536 0.05889 2.468 0.013859 *
## Books 0.11725 0.26326 0.445 0.656209
## Personal 0.01811 0.07241 0.250 0.802545
## PhD -7.90837 5.21170 -1.517 0.129702
## Terminal -5.47538 5.85494 -0.935 0.350087
## S.F.Ratio 18.01171 15.42608 1.168 0.243441
## perc.alumni 0.31887 4.77088 0.067 0.946734
## Expend 0.07428 0.01409 5.271 1.92e-07 ***
## Grad.Rate 11.10438 3.45399 3.215 0.001377 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1067 on 582 degrees of freedom
## Multiple R-squared: 0.9264, Adjusted R-squared: 0.9242
## F-statistic: 430.8 on 17 and 582 DF, p-value: < 2.2e-16
ols_pred <- predict(model_linear, test)
ols_mse <- mean((ols_pred - test$Apps)^2)
ols_mse
## [1] 961142.3
The testing MSE for the linear model is 9.6114227^{5}.
lambda.min
and lambda.1se
to select the optimal \(\lambda\) for both methods.Answer:
train_data = data.matrix(train)
test_data = data.matrix(test)
Fitting the models
ridge.fit <- cv.glmnet(y = train_data[, 2], x = train_data[, -2],
alpha = 0, standardize = TRUE)
lasso.fit <- cv.glmnet(y = train_data[, 2], x = train_data[, -2],
alpha = 1, standardize = TRUE)
Summarize the models with prediction errors
ridge_mse1 <- mean((predict(ridge.fit, s = "lambda.min", newx = test_data[, -2]) - test_data[, 2])^2)
ridge_mse2 <- mean((predict(ridge.fit, s = "lambda.1se", newx = test_data[, -2]) - test_data[, 2])^2)
lasso_mse1 <- mean((predict(lasso.fit, s = "lambda.min", newx = test_data[, -2]) - test_data[, 2])^2)
lasso_mse2 <- mean((predict(lasso.fit, s = "lambda.1se", newx = test_data[, -2]) - test_data[, 2])^2)
cbind(c("Ridge (lambda.min)", "Ridge (lambda.1se)", "Lasso (lambda.min)", "Lasso (lambda.1se)"),
round(c(ridge_mse1, ridge_mse2, lasso_mse1, lasso_mse2), 2))
## [,1] [,2]
## [1,] "Ridge (lambda.min)" "875548.88"
## [2,] "Ridge (lambda.1se)" "1399187.51"
## [3,] "Lasso (lambda.min)" "908340.33"
## [4,] "Lasso (lambda.1se)" "1076205.51"
For this particular run, Ridge regression with
lambda.min
gives the lowest prediction error.
glmnet
package implemented a new feature
called relaxed
fits and the associated tuning parameter
gamma
. You can find some brief explanation of this feature
at the documentation of this package. See* [CRAN Documentation](https://cran.r-project.org/web/packages/glmnet/glmnet.pdf)
* [glmnet Vignette](https://glmnet.stanford.edu/articles/glmnet.html)
Read these documentations regarding the `gamma` parameter, and summarize the idea of this feature in terms of the loss function being used. You need to write it specifically in terms of the data vectors $\mathbf y$ and matrix $\mathbf X$ and define any notations you need. Only consider the Lasso penalty for this question.
After this, implement this feature and utilize the cross-validation to find the optimal $\lambda$ and $\gamma$ for the College dataset. Report the test error for the optimal model.
Answer:
The new feature relaxed
fits in the glmnet
package considers a mixture of the model fit: one as the original Lasso
fit and the other as the fit of the OLS estimator without penalty. This
means that first each fixed \(\lambda\)
and \(\gamma\) values, the model fit
reported is
\[ \gamma \frac{1}{2n} \sum_{i=1}^n (y_i - \mathbf x_i^T \boldsymbol \beta)^2 + \lambda \sum_{j=1}^p |\beta_j| + (1 - \gamma) \frac{1}{2n} \sum_{i=1}^n (y_i - \mathbf x_i^T \widehat{\boldsymbol \beta}_\text{sub})^2 \]
where \(\widehat{\boldsymbol
\beta}_\text{sub}\) is the OLS estimator fitted on the subset of
variables selected (non-zero coefficients) by the Lasso. To implement
this feature, we need to set relax = TRUE
in the
cv.glmnet()
function.
relaxed_fit <- cv.glmnet(y = train_data[, 2], x = train_data[, -2], relax = TRUE, gamma = seq(0, 1, 0.2))
relaxed_mse <- mean((predict(relaxed_fit, s = "lambda.min", newx = test_data[, -2]) - test_data[, 2])^2)
relaxed_mse
## [1] 948102.7
In HW3, we used golub
dataset from the
multtest
package. This dataset contains 3051 genes from 38
tumor mRNA samples from the leukemia microarray study Golub et
al. (1999). The outcome golub.cl
is an indicator for two
leukemia types: Acute Lymphoblastic Leukemia (ALL) or Acute Myeloid
Leukemia (AML). In genetic analysis, many gene expressions are highly
correlated. Hence we could consider the Elastic net model for both
sparsity and correlation.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("multtest")
Fit logistic regression to this dataset. Use a grid of \(\alpha\) values in \([0, 1]\) and report the best \(\alpha\) and \(\lambda\) values using 19-fold cross-validation.
Solution:
# Load required packages
library(glmnet)
library(multtest)
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
## tapply, union, unique, unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
# Load Golub leukemia data
data(golub)
# Extract expression data (transposed for glmnet) and class labels
X = t(as.matrix(golub))
y = as.factor(golub.cl)
# Initialize variables to keep track of results
best_alpha = NULL
best_lambda = NULL
min_cv_error = Inf
# List of alpha values to try
alpha_values = seq(0, 1, by=0.1)
# Loop over alpha values
for (alpha in alpha_values) {
# Fit elastic net model with current alpha value using cv.glmnet()
cvfit = cv.glmnet(X, y, family="binomial", alpha=alpha, folds = 19)
# Retrieve the mean cross-validated error for the optimal lambda
cv_error = min(cvfit$cvm)
# Update best_alpha and best_lambda if this model is better than previous best
if (cv_error < min_cv_error) {
best_alpha = alpha
best_lambda = cvfit$lambda.min
min_cv_error = cv_error
best_model = cvfit
}
}
## Warning in lognet(x, is.sparse, y, weights, offset, alpha, nobs, nvars, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
# Output the best alpha and lambda
cat("Best alpha:", best_alpha, "\n")
## Best alpha: 0.2
cat("Best lambda:", best_lambda, "\n")
## Best lambda: 0.01957254