Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to reinstall all of your packages.Here is a GPT-4 prompt I used:
What is the advantage of a Ridge regression? Can you suggest a simulation study to confirm that? Please write your code in R.
The following is the answer produced by GPT-4
Ridge regression is a regularization technique that helps to solve the problem of multicollinearity in linear regression. It can handle high-dimensional data where the number of predictors may exceed the number of observations. The advantages include:
You can simulate the advantage of Ridge regression by comparing its performance with ordinary linear regression on a dataset with multicollinearity. Here’s an R script to conduct this study:
# Load required library
library(MASS)
library(glmnet)
# Simulate data with multicollinearity
set.seed(42)
n <- 100
p <- 10
X <- matrix(rnorm(n * p), n, p)
X[,2] <- X[,1] + rnorm(n) * 0.01 # Introducing multicollinearity
beta <- runif(p)
y <- X %*% beta + rnorm(n)
# Ordinary Linear Regression
lm_model <- lm(y ~ X)
lm_pred <- predict(lm_model, newdata=data.frame(X=X))
# Ridge Regression
ridge_model <- glmnet(X, y, alpha=0)
ridge_pred <- predict(ridge_model, s=0.01, newx=X)
# Compare Performance
lm_mse <- mean((y - lm_pred)^2)
ridge_mse <- mean((y - ridge_pred)^2)
cat("MSE of Ordinary Linear Regression:", lm_mse, "\n")
cat("MSE of Ridge Regression:", ridge_mse, "\n")
The script generates a dataset with multicollinearity and then compares the mean squared error (MSE) of ordinary linear regression and ridge regression. Typically, you’ll observe that ridge regression provides a better performance in this scenario.
[15 points] Based on what we have learned, do you find this answer correct? does this code confirm the mentioned three properties of Ridge regressions? Please explain your judgement for each of them. Comment on how you would do to demonstrate these properties. You do not need to write the code.
[25 points] To properly demonstrate the bias-variance trade-off, we could consider using a (correct) simulation. Adapt this existing code into a simulation study to show this properties. While you are doing this, please consider the following:
glmnet()
exp(seq(log(0.5), log(0.01), out.length = 100))*trainn
We will use the golub
dataset from the
multtest
package. This dataset contains 3051 genes from 38
tumor mRNA samples from the leukemia microarray study Golub et
al. (1999). This package is not included in R
, but on
bioconductor
. Install the latest version of this package
from bioconductor
, and read the documentation of this
dataset to understand the data structure of golub
and
golub.cl
.
[25 points] We will not use this data for classification (the original problem). Instead, we will do a toy regression example to show how genes are highly correlated and could be used to predict each. Carry out the following tasks:
golub.cl
using mt.teststat()
. Use
t.equalvar
(two sample \(t\) test with equal variance) as the test
statistic.[5 points] Based on your results, do you observe any bias-variance trade-off? If not, can you explain why?
Recall the previous homework, we have a quadratic function for minimization. We know that analytical solution exist. However, in this example, let’s use coordinate descent to solve the problem. To demonstrate this, let’s consider the following simulated dataset, with design matrix \(x\) (without intercept) and response vector \(y\):
set.seed(432)
n <- 100
x <- matrix(rnorm(n*2), n, 2)
y <- 0.7 * x[, 1] + 0.5 * x[, 2] + rnorm(n)
We will consider a model without the intercept term. In this case, our objective function (of \(\beta_1\) and \(\beta_2\) for linear regression is to minimize the sum of squared residuals:
\[ f(\beta_1, \beta_2) = \frac{1}{n} \sum_{i=1}^n (y_i - \beta_1 x_{i1} - \beta_2 x_{i2})^2 \]
where \(x_{ij}\) represents the \(j\)th variable of the \(i\)th observation.
[10 points] Write down the objective function in the form of \[ f(x,y) = a \beta_1^2 + b \beta_2^2 + c \beta_1 \beta_2 + d \beta_1 + e \beta_2 + f \] by specifying what are coefficients a, b, c, d, e, and f, using the simulated data. Calculate them in R, using vector operations rather than for-loops.
[10 points] A coordinate descent algorithm essentially does two steps: i. Update \(\beta_1\) to its optimal value while keeping \(\beta_2\) fixed ii. Update \(\beta_2\) to its optimal value while keeping \(\beta_1\) fixed
Write down the updating rules for \(\beta_1\) and \(\beta_2\) using the coordinate descent algorithm. Use those previously defined coefficients in your fomula and write them in Latex. Implement them in a for-loop algorithm in R that iterates at most 100 times. Use the initial values \(\beta_1 = 0\) and \(\beta_2 = 0\). Decide your stopping criterion based on the change in \(\beta_1\) and \(\beta_2\). Validate your solution using the lm() function.