Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to reinstall all of your packages.Here is a GPT-4 prompt I used:
What is the advantage of a Ridge regression? Can you suggest a simulation study to confirm that? Please write your code in R.
The following is the answer produced by GPT-4
Ridge regression is a regularization technique that helps to solve the problem of multicollinearity in linear regression. It can handle high-dimensional data where the number of predictors may exceed the number of observations. The advantages include:
You can simulate the advantage of Ridge regression by comparing its performance with ordinary linear regression on a dataset with multicollinearity. Here’s an R script to conduct this study:
# Load required library
library(MASS)
library(glmnet)
# Simulate data with multicollinearity
#set.seed(42)
n <- 100
p <- 10
X <- matrix(rnorm(n * p), n, p)
X[,2] <- X[,1] + rnorm(n) * 0.01 # Introducing multicollinearity
beta <- runif(p)
y <- X %*% beta + rnorm(n)
# Ordinary Linear Regression
lm_model <- lm(y ~ X)
lm_pred <- predict(lm_model, newdata=data.frame(X=X))
# Ridge Regression
ridge_model <- glmnet(X, y, alpha=0)
ridge_pred <- predict(ridge_model, s=0.01, newx=X)
# Compare Performance
lm_mse <- mean((y - lm_pred)^2)
ridge_mse <- mean((y - ridge_pred)^2)
cat("MSE of Ordinary Linear Regression:", lm_mse, "\n")
cat("MSE of Ridge Regression:", ridge_mse, "\n")
The script generates a dataset with multicollinearity and then compares the mean squared error (MSE) of ordinary linear regression and ridge regression. Typically, you’ll observe that ridge regression provides a better performance in this scenario.
* Bias-Variance Trade-off
* Stability
* Including All Predictors
* You can borrow similar ideas of simulation we used in previous lecture notes
* Modify the GPT-4 code with the following settings to generate the data:
* trainning sample size $trainn = 50$
* Testing sample size $testn = 200$
* $p = 200$
* Fix $b = rep(0.1, p)$ for all simulation runs
* Since linear regression doesn't work in this setting, you only need to consider `glmnet()`
* Use a set of $\lambda$ values `exp(seq(log(0.5), log(0.01), out.length = 100))*trainn`
* Instead of evaluating the bias and variance separately (we will do that in the future), we will __use the testing error as the metric__.
* Demonstrate your result using plots and give a clear explanation of your findings. Particularly, which side of the result displays a large bias, and which side corresponds to a large variance?
Answer:
Based on the requirements, we can modify the code as follows. Specifically, we simulate the data 200 times, and calculate the prediction error of a Ridge regression at each \(\lambda\) value. We then average the results over all simulation runs and plot the results.
# Load required library
library(MASS)
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-4
# parameters
set.seed(42)
n = 250
trainn = 50
nsim = 200
p = 200
# set a fixed beta
b = rep(0.1, p)
# set covariance matrix
Sigma = matrix(0.7, p, p)
diag(Sigma) = 1
# set a set of lambda
lambda = exp(seq(log(1), log(0.01), length.out = 100))*trainn
# object to record results
ridge_mse = matrix(NA, nsim, length(lambda))
# start simulation
for (k in 1:nsim)
{
# generate data
X <- matrix(rnorm(n * p), n, p)
X[,2] <- X[,1] + rnorm(n) * 0.01 # Introducing multicollinearity
y <- X %*% b + rnorm(n)
alldata = data.frame(X, y)
traindata = alldata[1:trainn, ]
testdata = alldata[-(1:trainn), ]
# Ridge Regression
ridge_model <- glmnet(data.matrix(traindata[, 1:p]), traindata$y, alpha=0, lambda = lambda)
ridge_pred <- predict(ridge_model, newx=data.matrix(testdata[, 1:p]))
# Compare Performance
ridge_mse[k, ] <- colMeans((sweep(ridge_pred, 1, testdata$y, "-"))^2)
}
# plot the results
plot(lambda, colMeans(ridge_mse), type = "l", col = c("black", "red"), lwd = 2,
xlab = "lambda", ylab = "MSE")
This shows that as we changing the \(\lambda\) value, the testing error first decreases and then increases. This is consistent with the bias-variance trade-off. When \(\lambda\) is small, the model is more flexible and has a small bias but large variance. When \(\lambda\) is large, the model is less flexible and has a large bias but small variance.
We will use the golub
dataset from the
multtest
package. This dataset contains 3051 genes from 38
tumor mRNA samples from the leukemia microarray study Golub et
al. (1999). This package is not included in R
, but on
bioconductor
. Install the latest version of this package
from bioconductor
, and read the documentation of this
dataset to understand the data structure of golub
and
golub.cl
.
[25 points] We will not use this data for classification (the original problem). Instead, we will do a toy regression example to show how genes are highly correlated and could be used to predict each. Carry out the following tasks:
golub.cl
using mt.teststat()
. Use
t.equalvar
(two sample \(t\) test with equal variance) as the test
statistic.[5 points] Based on your results, do you observe any bias-variance trade-off? If not, can you explain why?
Answer:
The following code shows how to install the multtest
package from bioconductor
and load the data.
We can then perform the required tasks. First, using
cv.glmnet
to perform ridge regression with 19-fold
cross-validation.
library(glmnet)
library(multtest)
data(golub)
# perform marginal association test for each gene with the response golub.cl
t_stat <- mt.teststat(golub, golub.cl, test="t.equalvar")
# sort the names of genes by p-values
t_stat <- sort(abs(t_stat), index.return=TRUE)
# select the top 100 genes
topgenes <- t_stat$ix[1:100]
# construct the dataset with the top 100 genes and the remaining genes
y0 <- t(golub[topgenes, ])
X <- t(golub[-topgenes, ])
# perform PCA on the top 100 genes and extract the first principal component
y <- prcomp(y0)$x[, 1]
# perform ridge regression with 19-fold cross-validation
cvfit <- cv.glmnet(X, y, alpha=0, nfolds=19)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold
cvfit$lambda.1se
## [1] 73.53959
plot(cvfit)
The plot shows that the model with $= $73.54 has the smallest
cross-validation error. Next, we fit ridge regression using the
lm.ridge()
function and use GCV as the selection
criterion.
# fit ridge regression using lm.ridge() and use GCV as the selection criterion
library(MASS)
lambda.seq <- seq(0.01, cvfit$lambda.1se, length.out = 100)
fit <- lm.ridge(y ~ X, lambda = lambda.seq)
plot(lambda.seq, fit$GCV)
Both cross-validation and GCV suggest that the model fits well, but
without a clear bias-variance trade-off. There are many possible answers
to this, but most importantly, the top selected genes are all highly
associated with the outcome label golub.cl
, making their
first PC also highly associated with golub.cl
. This can be
seen from the following plot.
plot(prcomp(y0)$x[, 1:2], col=golub.cl + 1, pch=19, xlab="PC1", ylab="PC2")
On the other hand, if we check the PC of all other variables, their also align well with the class labels. In fact, if we check the correlation between these two first PCs, we’ll see that they are almost identical. This is possibely becuase \(X\) also contains many genes that are similar to the top 100 genes. Hence, based on the ridge regression concept, we do not need to penalized much of the coefficients since we get a good fit with small variance. The benefit of using a large \(\lambda\) for reducing variance does not overcome the cost of additional bias. This property doesn’t just happen with any subset of genes. You can check if you use a random subset of genes, the result will be different. However, if you select a large amount of genes, this property will be observed again, since they will likely resenble the same PC as the rest of the data.
plot(prcomp(X)$x[, 1:2], col=golub.cl + 1, pch=19, xlab="PC1", ylab="PC2")
cor(prcomp(X)$x[, 1], prcomp(y0)$x[, 1])
## [1] 0.06700149
randomgenes <- sample(1:nrow(golub), 20)
cor(prcomp(t(golub[randomgenes, ]))$x[, 1], prcomp(t(golub[-randomgenes, ]))$x[, 1])
## [1] 0.6518281
randomgenes <- sample(1:nrow(golub), 400)
cor(prcomp(t(golub[randomgenes, ]))$x[, 1], prcomp(t(golub[-randomgenes, ]))$x[, 1])
## [1] 0.9287385