Question 1: Another Simulation Study



Here is a GPT-5 prompt I used to create some content:

How does ridge regression deal with correlated variables? Is it always better than Marrow’s C_p for prediction? Can you suggest a code with 20 variables to confirm your answer?

The following is the answer produced by GPT-5

Here’s a concise answer:

Ridge with correlated variables: Ridge regression shrinks coefficients toward zero in a way that distributes the effect across correlated predictors. Instead of picking one variable and ignoring the others, it tends to keep correlated variables together and stabilize their estimates. This reduces variance and avoids the instability of ordinary least squares.

Ridge vs. Mallows’ \(C_p\): Ridge is not always better. Ridge generally performs better when the signal is spread across many correlated predictors (dense signals). Mallows’ \(C_p\), used for subset selection, can perform better when only a few predictors truly matter (sparse signals).

In practice, which is better depends on the structure of the true underlying model. Would you like me to provide a small R code example (with 20 variables) that demonstrates both cases — dense vs. sparse signals — so you can see the difference in prediction performance?



Part a) [15 pts]

I would like to first consider a simulation study to understand the performance. And while doing this, let’s also compare the bias and variance of the estimations. Consider the following data generator, with \(p = 20\), but \(y\) depends on only the first two variables:

\[ y = x_1 + x_2 + \epsilon \]

where \(\epsilon \sim N(0,1)\) and \(x_1\) and \(x_2\) are generated from independent standard normal distributions, while the rest of the \(p-2\) variables are generated in the following way:

\[ x_j \sim \eta_j x_1 + \phi_j x_2 + \xi_j, \quad j = 3, \ldots, p \]

where \(\eta_j, \phi_j \sim \text{Uniform}(-1, 1)\) and \(\xi_j \sim N(0, 0.3^2)\) are all independently generated. Write your code to generate this data with \(n = 100\), and fit both the ridge regression (use leave-one-out CV and lambda.min to pick the best tuning parameter) and the best subset selection (using Mallows’ \(C_p\)). Generate another \(1000\) testing data, and evaluate the prediction performance of both methods.

  • Which variable(s) were selected by your Mallows’ \(C_p\) method?
  • Which variable was given the largest coefficient (based on absolute values) by your ridge regression?
  • Which method had the smaller test error?

Answer:

  library(MASS)
  library(glmnet)
  library(leaps)
  library(ggplot2)
  set.seed(432)
  
  # generate data
  ntrain = 100
  ntest = 1000
  n <- ntrain + ntest
  p <- 20
  
  X <- matrix(0, n, p)
  X[, 1] <- rnorm(n)
  X[, 2] <- rnorm(n)
  for (j in 3:p) {
    eta_j <- runif(1, -1, 1)
    phi_j <- runif(1, -1, 1)
    xi_j <- rnorm(n, 0, sd = 0.3)
    X[, j] <- eta_j * X[, 1] + phi_j * X[, 2] + xi_j
  }
  colnames(X) <- paste0("x", 1:p)

  beta <- c(1, 1, rep(0, p - 2))

  y <- X %*% beta + rnorm(n)
  data <- data.frame(y = y, X)
  
  traindata <- data[1:ntrain, ]
  testdata <- data[-(1:ntrain), ]
  
  # fit ridge regression with 
  cv_ridge <- cv.glmnet(as.matrix(traindata[, -1]), traindata$y, alpha = 0, nfolds = nrow(traindata))
  ridge_pred <- predict(cv_ridge, s = "lambda.min", newx = as.matrix(testdata[, -1]))
  ridge_coefs <- as.vector(coef(cv_ridge, s = "lambda.min"))[-1]
  
  # fit best subset selection using Mallows' Cp
  regfit_full <- regsubsets(y ~ ., data = traindata, nvmax = p)
  reg_summary <- summary(regfit_full)
  best_subset <- reg_summary$which[which.min(reg_summary$cp), -1]
  
  # refit the best model
  refit <- lm(y ~ ., data = traindata[, c(TRUE, best_subset)])
  best_pred <- predict(refit, newdata = testdata)
  best_vars <- names(traindata)[-1][best_subset]
  
  # calculate test errors
  ridge_mse <- mean((testdata$y - ridge_pred)^2)
  best_mse <- mean((testdata$y - best_pred)^2)
  which.max(abs(ridge_coefs))
## [1] 10
  best_vars
## [1] "x7"  "x10" "x16" "x17"
  ridge_mse
## [1] 1.107624
  best_mse
## [1] 1.187038

In this run I have, the variables selected by Mallows’ \(C_p\) are x7, x10, x16 and x17. The variables with the largest coefficient in ridge regression is x10 with coefficients -0.363. The test error for ridge regression is 1.108, while the test error for best subset selection is 1.187. In this case, ridge regression has a smaller test error.

Part b) [25 pts]

Now, let’s consider a simulation. For this question, we will repeat the above procedure 200 times, record their prediction errors and also evaluate the bias and variance of each of the estimations of the coefficients. To do this, we need to record the estimates of the coefficients for all variables for both methods, and store them. For the Marlow’s \(C_p\) method, if a variable is not selected, then we record its coefficient as 0. After all simulations, you should have a matrix of \(200 \times 20\) for each method. The intercept is not needed. Based on this matrix, you can calculate the bias and variance of the estimations of these coefficients for each variable by

  • Comparing the mean of these estimates to the true \(\beta\) values
  • Calculating the variance of these estimates over the 200 simulation runs

Answer the following question:

  • Provide a barplot to visualize and compare the parameter estimates for both methods.
  • Looking at the bias result overall, what patterns do you see?
  • Looking at the variance result overall, what patterns do you see?
  • Which method has a smaller testing error?

Answer:

  B <- 200
  ridge_coefs_mat <- matrix(0, nrow = B, ncol = p)
  best_coefs_mat <- matrix(0, nrow = B, ncol = p)
  ridge_mse_vec <- numeric(B)
  best_mse_vec <- numeric(B)
  
  for (b in 1:B) {
    # generate data
    X <- matrix(0, n, p)
    X[, 1] <- rnorm(n)
    X[, 2] <- rnorm(n)
    for (j in 3:p) {
      eta_j <- runif(1, -1, 1)
      phi_j <- runif(1, -1, 1)
      xi_j <- rnorm(n, 0, sd = 0.3)
      X[, j] <- eta_j * X[, 1] + phi_j * X[, 2] + xi_j
    }
    colnames(X) <- paste0("x", 1:p)
  
    beta <- c(1, 1, rep(0, p - 2))
  
    y <- X %*% beta + rnorm(n)
    data <- data.frame(y = y, X)
    
    traindata <- data[1:ntrain, ]
    testdata <- data[-(1:ntrain), ]
    
    # fit ridge regression
    cv_ridge <- cv.glmnet(as.matrix(traindata[, -1]), traindata$y, alpha = 0, nfolds = nrow(traindata))
    ridge_pred <- predict(cv_ridge, s = "lambda.min", newx = as.matrix(testdata[, -1]))
    ridge_coefs_mat[b, ] <- as.vector(coef(cv_ridge, s = "lambda.min"))[-1]
    
    # fit best subset selection using Mallows' Cp
    regfit_full <- regsubsets(y ~ ., data = traindata, nvmax = p)
    reg_summary <- summary(regfit_full)
    best_subset <- reg_summary$which[which.min(reg_summary$cp), -1]
    
    # refit the best model
    refit <- lm(y ~ ., data = traindata[, c(TRUE, best_subset)])
    best_pred <- predict(refit, newdata = testdata)
    best_coefs_mat[b, best_subset] <- coef(refit)[-1]
    
    # calculate test errors
    ridge_mse_vec[b] <- mean((testdata$y - ridge_pred)^2)
    best_mse_vec[b] <- mean((testdata$y - best_pred)^2)
  }
  
  # calculate bias and variance
  true_beta <- beta
  ridge_bias <- colMeans(ridge_coefs_mat) - true_beta
  ridge_var <- apply(ridge_coefs_mat, 2, var)
  
  best_bias <- colMeans(best_coefs_mat) - true_beta
  best_var <- apply(best_coefs_mat, 2, var)
  
  ridge_mse_mean <- mean(ridge_mse_vec)
  best_mse_mean <- mean(best_mse_vec)
  
  # produce a boxplot of ridge_coefs_mat and ridge_coefs_mat
  par(mfrow = c(1, 2))
  boxplot(ridge_coefs_mat, main = "Ridge Coefficients", xlab = "Variables", ylab = "Coefficient Estimates")
  boxplot(best_coefs_mat, main = "Best Subset Coefficients", xlab = "Variables", ylab = "Coefficient Estimates")

  # plot the bias and variance
  plot(1:20, ridge_bias, type = "b", col = "blue", pch = 19, xlab = "Variables", ylab = "Bias", 
       main = "Bias Comparison", ylim = c(min(ridge_bias, best_bias), max(ridge_bias, best_bias)))
  points(1:20, best_bias, type = "b", col = "red", pch = 17)
  legend("bottomright", legend = c("Ridge", "Best Subset"), col = c("blue", "red"), pch = c(19, 17))
  
  plot(1:20, ridge_var, type = "b", col = "blue", pch = 19, xlab = "Variables", ylab = "Variance", 
       main = "Variance Comparison", ylim = c(0, max(ridge_var, best_var)))
  points(1:20, best_var, type = "b", col = "red", pch = 17)
  legend("topright", legend = c("Ridge", "Best Subset"), col = c("blue", "red"), pch = c(19, 17))

  ridge_mse_mean
## [1] 1.085818
  best_mse_mean
## [1] 1.180903

My result shows the average test error for ridge regression is 1.086, while the average test error for \(C_p\) selection is 1.181. In this case, ridge regression has a smaller average test error. However, the bias for the ridge regression is very large compared to the \(C_p\) selection, especially for the first two variables. For other variables, the bias for \(C_p\) is almost 0. The variance for ridge regression is much smaller than that of \(C_p\) selection, for the first two variables.

Part c) [10 pts]

Based on your results (b)

  • Can you now compare your results to the answer provided by GPT-5? Do you agree or disagree with the statements made there? Please provide your reasoning.
  • Based on these simulation results, can you comment on the bias-variance trade-off for these two methods?

Answer:

The statements made by GPT-5 are generally accurate. Ridge regression does tend to shrink coefficients and distribute the effect across correlated predictors, which helps stabilize estimates and reduce variance. In contrast, best subset selection (using Mallows’ \(C_p\)) can lead to more variable estimates, especially when the true model is sparse, as it may select different subsets of variables in different simulations.

Based on our understanding of the bias-variance trade-off, ridge regression has a larger bias but smaller variance, while best subset selection has a smaller bias but larger variance. But overall the trade-off seems to favor ridge regression in this case, leading to a smaller test error.

Question 2: Modeling High-Dimensional Data

We will use the golub dataset from the multtest package. This dataset contains 3051 genes from 38 tumor mRNA samples from the leukemia microarray study Golub et al. (1999). This package is not included in R, but on bioconductor. Install the latest version of this package from bioconductor, and read the documentation of this dataset to understand the data structure of golub and golub.cl.

Part a) [25 points]

We will not use this data for classification (the original problem). Instead, we will do a toy regression example to show how genes are highly correlated and could be used to predict each. Carry out the following tasks:

  • Perform marginal association test for each gene with the response golub.cl using mt.teststat(). Use t.equalvar (two sample \(t\) test with equal variance) as the test statistic.
  • Sort the genes by their p-values and select the top 100 genes
  • Construct a dataset with the top 10 genes and another one (call it \(X\)) with the remaining 90 genes
  • Perform principal component analysis (PCA) on the top 10 genes and extract the first principal component, use this as the outcome \(y\). Be careful about the orientation of the data matrix.
  • Perform ridge regression with 19-fold cross-validation on \(X\) and the outcome \(y\). Does your model fit well? Can you provide detailed model fitting results to support your claim?
  • Fit ridge regression but use GCV as the criterion. Does your model fit well?

Answer:

The following code shows how to install the multtest package from bioconductor and load the data.

We can then perform the required tasks. First, using cv.glmnet to perform ridge regression with 19-fold cross-validation.

  library(glmnet)
  library(multtest)
  data(golub)
  # perform marginal association test for each gene with the response golub.cl
  pvals <- mt.teststat(golub, golub.cl, test="t.equalvar")

  # sort the names of genes by p-values
  pvals <- sort(pvals, index.return=TRUE)
  
  # select the top 10 genes
  top10genes <- pvals$ix[1:10]
  
  # select the next 90 genes
  next90genes <- pvals$ix[11:100]
  
  # construct the dataset with the top 10 genes and the next 90 genes
  y0 <- t(golub[top10genes, ])
  X <- t(golub[next90genes, ])

  # perform PCA on the top 10 genes and extract the first principal component
  y <- prcomp(y0)$x[, 1]

  # perform ridge regression with 19-fold cross-validation
  cvfit <- cv.glmnet(X, y, alpha=0, nfolds=19)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold
  cvfit$lambda.min
## [1] 16.06471
  plot(cvfit)

We print the lambda.min value above, this is the value of lambda that gives the smallest cross-validation error. We could have also chosen the lambda.1se if we wanted to introduce more bias. The plot above shows the mean-squared error vs lambda. Next, we fit ridge regression using thelm.ridge() function and use GCV as the selection criterion.

  # fit ridge regression using lm.ridge() and use GCV as the selection criterion
  library(MASS)
  lambda.seq <- seq(0.01, 2*cvfit$lambda.1se, length.out = 100)
  fit <- lm.ridge(y ~ X, lambda = lambda.seq)
  plot(lambda.seq, fit$GCV)

GCV increases as \(\lambda\) increases, indicating that the introduction of bias is hurting the model fit. This is because \(X\) includes the 11th-100th most important genes, which are used to come up with the \(y\).

Part b) [5 points]

Based on your results, do you observe a “U” shaped curve? Can you comment on how is this related to the bias-variance trade-off?

Answer:

We do not observe a U shaped curve. The prediction error directly increase as the penalty increases. This is because the smallest penalty gives the best fit (small bias and large variance). The linear combination we used in PCA to generate \(y\) uses the 11th-100th most important genes. Thus, it is reasonable to expect that a model that includes all those genes would have the lowest MSE. A model that includes all the genes, or predictors, would be one that has minimal bias, corresponding to a small \(\lambda\) being optimal.

Question 3: Linear Regression with Coordinate Descent

Recall the previous homework, we have a quadratic function for minimization. We know that analytic solution exist. However, in this example, let’s use coordinate descent to solve the problem. To demonstrate this, let’s consider the following simulated dataset, with design matrix \(x\) (without intercept) and response vector \(y\):

  set.seed(432)
  n <- 150
  x <- matrix(rnorm(n*2), n, 2)
  y <- 0.3 * x[, 1] + 0.5 * x[, 2] + rnorm(n)

We will consider a model without the intercept term. In this case, our objective function (of \(\beta_1\) and \(\beta_2\) for linear regression is to minimize the sum of squared residuals:

\[ f(\beta_1, \beta_2) = \frac{1}{n} \sum_{i=1}^n (y_i - \beta_1 x_{i1} - \beta_2 x_{i2})^2 \]

where \(x_{ij}\) represents the \(j\)th variable of the \(i\)th observation.

Part a) [10 points]

Write down the objective function in the form of

\[ f(x,y) = a \beta_1^2 + b \beta_2^2 + c \beta_1 \beta_2 + d \beta_1 + e \beta_2 + f \]

by specifying what are coefficients a, b, c, d, e, and f, using the simulated data. Calculate them in R, using vector operations rather than for-loops.

Answer:

The objective function can be written as:

\[ f(\beta_1, \beta_2) = \frac{1}{n} \sum_{i=1}^n (y_i - \beta_1 x_{i1} - \beta_2 x_{i2})^2 \]

Expanding the squared term, we get:

\[ f(\beta_1, \beta_2) = \frac{1}{n} \left(\sum_{i=1}^n y_i^2 - 2 \beta_1 \sum_{i=1}^n x_{i1} y_i - 2 \beta_2 \sum_{i=1}^n x_{i2} y_i + \beta_1^2 \sum_{i=1}^n x_{i1}^2 + 2 \beta_1 \beta_2 \sum_{i=1}^n x_{i1} x_{i2} + \beta_2^2 \sum_{i=1}^n x_{i2}^2\right) \]

Let \(a = \sum_{i=1}^n x_{i1}^2\), \(b = \sum_{i=1}^n x_{i2}^2\), \(c = \sum_{i=1}^n x_{i1} x_{i2}\), \(d = \sum_{i=1}^n x_{i1} y_i\), \(e = \sum_{i=1}^n x_{i2} y_i\), and \(f = \sum_{i=1}^n y_i^2\). Then, the objective function can be written as:

\[ f(\beta_1, \beta_2) = \frac{1}{n} (a \beta_1^2 + b \beta_2^2 + 2 c \beta_1 \beta_2 - 2d \beta_1 - 2e \beta_2 + f) \] To calculate these coefficients, we can use the simulated data:

  a = sum(x[, 1]^2)
  b = sum(x[, 2]^2)
  c = sum(x[, 1] * x[, 2])
  d = sum(x[, 1] * y)
  e = sum(x[, 2] * y)
  f = sum(y^2)
  
  cat("a =", a, "\n")
## a = 131.0372
  cat("b =", b, "\n")
## b = 126.2778
  cat("c =", c, "\n")
## c = 7.553435
  cat("d =", d, "\n")
## d = 34.90129
  cat("e =", e, "\n")
## e = 74.76299
  cat("f =", f, "\n")
## f = 172.4326

Part b) [10 points]

A coordinate descent algorithm essentially does two steps:

  1. Update \(\beta_1\) to its optimal value while keeping \(\beta_2\) fixed
  2. Update \(\beta_2\) to its optimal value while keeping \(\beta_1\) fixed

Write down the updating rules for \(\beta_1\) and \(\beta_2\) using the coordinate descent algorithm. Use those previously defined coefficients in your formula and write them in Latex. Implement them in a for-loop algorithm in R that iterates at most 100 times. Use the initial values \(\beta_1 = 0\) and \(\beta_2 = 0\). Decide your stopping criterion based on the change in \(\beta_1\) and \(\beta_2\). Validate your solution using the lm() function.

Answer:

The updating rules for \(\beta_1\) and \(\beta_2\) using the coordinate descent algorithm are as follows:

  1. Update \(\beta_1\) to its optimal value while keeping \(\beta_2\) fixed:

\[ \beta_1 = \frac{d - c \beta_2}{a} \]

  1. Update \(\beta_2\) to its optimal value while keeping \(\beta_1\) fixed:

\[ \beta_2 = \frac{e - c \beta_1}{b} \]

The following R code implements the coordinate descent algorithm:

  beta1 = 0
  beta2 = 0
  max_iter = 100
  tol = 1e-6
  
  for (i in 1:max_iter)
  {
    beta1_new = (d - c * beta2) / a
    beta2_new = (e - c * beta1) / b
    
    if (abs(beta1_new - beta1) < tol & abs(beta2_new - beta2) < tol)
    {
      break
    }
    
    beta1 = beta1_new
    beta2 = beta2_new
  }
  
  cat("beta1 =", beta1, "\n")
## beta1 = 0.2330224
  cat("beta2 =", beta2, "\n")
## beta2 = 0.5781132
  summary(lm(y ~ x - 1))
## 
## Call:
## lm(formula = y ~ x - 1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.74885 -0.68575 -0.09978  0.52779  2.43067 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## x1  0.23302    0.07915   2.944  0.00376 ** 
## x2  0.57811    0.08063   7.170 3.32e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9045 on 148 degrees of freedom
## Multiple R-squared:  0.2978, Adjusted R-squared:  0.2883 
## F-statistic: 31.39 on 2 and 148 DF,  p-value: 4.336e-12