Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: Another Simulation Study

Here is a GPT-4 prompt I used:

What is the advantage of a Ridge regression? Can you suggest a simulation study to confirm that? Please write your code in R.

The following is the answer produced by GPT-4



Ridge regression is a regularization technique that helps to solve the problem of multicollinearity in linear regression. It can handle high-dimensional data where the number of predictors may exceed the number of observations. The advantages include:

You can simulate the advantage of Ridge regression by comparing its performance with ordinary linear regression on a dataset with multicollinearity. Here’s an R script to conduct this study:

    # Load required library
    library(MASS)
    library(glmnet)
    
    # Simulate data with multicollinearity
    set.seed(42)
    n <- 100
    p <- 10
    X <- matrix(rnorm(n * p), n, p)
    X[,2] <- X[,1] + rnorm(n) * 0.01 # Introducing multicollinearity
    beta <- runif(p)
    y <- X %*% beta + rnorm(n)
    
    # Ordinary Linear Regression
    lm_model <- lm(y ~ X)
    lm_pred <- predict(lm_model, newdata=data.frame(X=X))
    
    # Ridge Regression
    ridge_model <- glmnet(X, y, alpha=0)
    ridge_pred <- predict(ridge_model, s=0.01, newx=X)
    
    # Compare Performance
    lm_mse <- mean((y - lm_pred)^2)
    ridge_mse <- mean((y - ridge_pred)^2)
    
    cat("MSE of Ordinary Linear Regression:", lm_mse, "\n")
    cat("MSE of Ridge Regression:", ridge_mse, "\n")

The script generates a dataset with multicollinearity and then compares the mean squared error (MSE) of ordinary linear regression and ridge regression. Typically, you’ll observe that ridge regression provides a better performance in this scenario.



  1. [20 points] Based on what we have learned, do you find this answer correct? does this code confirm the mentioned three properties of Ridge regressions? Please explain your judgement for each of them. Comment on how you would do to demonstrate these properties. You do not need to write the code.

    • Bias-Variance Trade-off
    • Stability
    • Including All Predictors
  2. [45 points] To properly demonstrate the bias-variance trade-off, we could consider using a (correct) simulation. Adapt this existing code into a simulation study (repeat 200 times) to show this properties. While you are doing this, please consider the following:

    • You can borrow similar ideas of simulation we used in previous lecture notes
    • Modify the GPT-4 code with the following settings to generate the data:
      • trainning sample size \(trainn = 50\)
      • Testing sample size \(testn = 200\)
      • \(p = 200\)
      • Fix \(b = rep(0.1, p)\) for all simulation runs
    • Since linear regression doesn’t work in this setting, you only need to consider glmnet()
    • Use a set of \(\lambda\) values exp(seq(log(0.5), log(0.01), out.length = 100))*trainn
    • Instead of evaluating the bias and variance separately (we will do that in the future), we will use the testing error as the metric.
    • Demonstrate your result using plots and give a clear explanation of your findings. Particularly, which side of the result displays a large bias, and which side corresponds to a large variance?

Question 2: Modeling High-Dimensional Data

We will use the golub dataset from the multtest package. This dataset contains 3051 genes from 38 tumor mRNA samples from the leukemia microarray study Golub et al. (1999). This package is not included in R, but on bioconductor. Install the latest version of this package from bioconductor, and read the documentation of this dataset to understand the data structure of golub and golub.cl.

[30 points] We will not use this data for classification (the original problem). Instead, we will do a toy regression example to show how genes are highly correlated and could be used to predict each. Carry out the following tasks:

[5 points] Based on your results, do you observe any bias-variance trade-off? If not, can you explain why?