Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: Lasso Simulation [50 pts]

During our lecture, we considered a simulation model to analyze the variable selection property of Lasso. Now let’s further investigate the prediction error cased by the \(L1\) penalty under this model, and understand the bias-variance trade-off. For this question, your underlying true data generate should be

\[\begin{align} Y =& X^\text{T} \boldsymbol \beta + \epsilon \\ =& \sum_{j = 1}^p X_j 0.4^\sqrt{j} + \epsilon, \end{align}\]

where p\(= 30\), each \(X_j\) is generated independently from \(\cal{N}(0, 1)\), and \(\epsilon\) also follows a standard normal, independent from \(X\). The goal is to predict two target points and investigate how the prediction error changes under different penalties. The training data and two target testing points are defined by the following code.

    # target testing points
    p = 30
    xa = xb = rep(0, p)
    xa[2] = 1
    xb[10] = 1

Perform the following questions:

Answer:

    set.seed(1)
    library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
    n = 100
    p = 30
    all_lambda = exp(seq(-5, 5, 0.05))
    nsim = 200
    coefs = 0.4^sqrt(c(1:p))

    # testing data
    xtest = rbind(xa, xb)

    # store the errors
    errors = array(NA, dim = c(2, length(all_lambda), nsim))
    
    # repeat the simulation
    for (i in 1:nsim){
        # the design matrix
        x = matrix(rnorm(n*p), n, p)
        y = x %*% coefs + rnorm(n)
        
        # outcome
        ytest = as.vector(xtest %*% coefs)
        
        # fit the lasso and predict
        glmnetfit = glmnet(x, y, lambda = all_lambda)
        errors[,,i] = (predict(glmnetfit, xtest, all_lambda) - ytest)^2
    }

    # average the errors over all simulation runs
    mean_errors = apply(errors, c(1, 2), mean)
    xa_error = mean_errors[1, ]
    xb_error = mean_errors[2, ]

    # plot the errors
    plot(log(all_lambda), mean_errors[1, ], type = "l", xlab = "lambda", 
         ylab = "error", ylim = c(0, 0.1))
    points(log(all_lambda), mean_errors[2, ], col = "red", type = "l")
    legend("topleft", legend = c("xa", "xb"), col = c("black", "red"), lty = 1)

    # optimal lambda and error
    lambda_a = all_lambda[which.min(xa_error)]
    best_err_a = min(xa_error)
    lambda_b = all_lambda[which.min(xb_error)]
    best_err_b = min(xb_error)
    lambda_err = matrix(c(lambda_a, lambda_b, best_err_a, best_err_b), 2, 2)
    rownames(lambda_err) = c("xa", "xb")
    colnames(lambda_err) = c("optimal lambda", "optimal error")
    knitr::kable(data.frame(lambda_err), table.attr = "style='width:40%;'",
                 format = "html", digits = 5)
optimal.lambda optimal.error
xa 0.02872 0.02193
xb 0.10026 0.01161

The prediction for xa relies heavily on the estimation of a large parameter (the second entry), and a small amount of penalty would cause a relatively larger bias. As contrast, prediction of xb does not use most of the parameters, and even applying a large parameter will simply set its estimate to 0. Hence, using a larger penalty is only reducing the variance further, but does not introduce additional bias. This can also be seen in the figure that the maximum error for xa is much bigger than xb, because there is a lot of bias to introduce if we use a large penalty. Hence its preferably to use a large \(\lambda\) when predicting xb and a smaller \(\lambda\) for xa to better balance the bias-variance trade-off.

Question 2: Lasso with Correlated Variables

A key challenge in using Lasso regression is its potential difficulty in accurately selecting variables when predictors are highly correlated. This assignment is a simulation study designed to investigate the impact of variable correlation on the performance of the Lasso algorithm. Consider the linear model defined as:

\[ Y = X^\text{T} \boldsymbol \beta + \epsilon \]

Where \(\boldsymbol \beta = (\beta_1, \beta_2, \ldots, \beta_{30})^T\) with \(\beta_1 = \beta_{11} = \beta_{21} = 0.4\) and all other \(\beta\) parameters set to zero. The \(p\)-dimensional covariate \(X\) follows a multivariate Gaussian distribution:

\[ \mathbf{X} \sim {\cal N}(\mathbf{0}, \Sigma_{p\times p}). \]

In \(\Sigma\), all diagonal elements are 1, and all off-diagonal elements are \(\rho\).

a) Single Simulation Run [15 pts]

  • Generate 300 training and 100 testing samples independently based on the above model.
  • Use \(\rho = 0.1\).
  • Fit a Lasso model using cv.glmnet() on the training data with 10-fold cross-validation. Use lambda.1se to select the optimal \(\lambda\).
  • Report:
    • Prediction error (MSE) on the test data using lambda.1se.
    • Whether the correct model was selected (i.e., whether the nonzero variables are correctly identified and zero variables are correctly excluded). You may refer to HW3 for this property.

b) Multiple Runs [15 pts]

  • Perform 100 simulation runs as in part a).
  • For each run, record the prediction error and the oracle status of the selected variables.
  • Report the average prediction error and the proportion of runs where correct model was selected.

c) Impact of Increased Correlation [10 pts]

  • Repeat task b) with \(\rho = 0.5\).
  • Report the average prediction error and the proportion of oracle estimations.
  • Discuss the reasons behind any observed changes in the proportion of oracle estimations when \(\rho\) changes from 0.1 to 0.5.

Answer:

  # Load required packages
  library(glmnet)
  library(MASS)

  # Set random seed for reproducibility
  set.seed(432)

  # Initialize parameters
  p = 30
  beta = rep(0, p)
  beta[c(1, 11, 21)] = 0.4
  rho = 0.1
  ntrain = 300
  ntest = 100

  # Generate covariance matrix Sigma
  Sigma = matrix(rho, p, p)
  diag(Sigma) = 1

  # Generate training and testing data
  xtrain = mvrnorm(ntrain, rep(0, p), Sigma)
  xtest = mvrnorm(ntest, rep(0, p), Sigma)
  ytrain = xtrain %*% beta + rnorm(ntrain)
  ytest = xtest %*% beta + rnorm(ntest)

  # Fit Lasso model using 10-fold CV
  glmnetfit = cv.glmnet(xtrain, ytrain, nfolds = 10)

  # Make predictions and calculate test error
  ypred = predict(glmnetfit, xtest, s = "lambda.1se")
  test_err = mean((ytest - ypred)^2)

  # Check if correct model was selected
  beta_hat = as.vector(coef(glmnetfit, s = "lambda.1se"))[-1]
  is_correct = all((beta_hat != 0) == (beta != 0))
  # Initialize arrays for storing simulation results
  nsim = 100
  errors_all = rep(0, nsim)
  correct_all = rep(FALSE, nsim)

  # Run multiple simulations with rho = 0.1
  for (i in 1:nsim) {
      # Generate new training and testing data
      xtrain = mvrnorm(ntrain, rep(0, p), Sigma)
      xtest = mvrnorm(ntest, rep(0, p), Sigma)
      ytrain = xtrain %*% beta + rnorm(ntrain)
      ytest = xtest %*% beta + rnorm(ntest)
      
      # Fit Lasso model and predict
      glmnetfit = cv.glmnet(xtrain, ytrain, nfolds = 10)
      ypred = predict(glmnetfit, xtest, s = "lambda.1se")
      errors_all[i] = mean((ytest - ypred)^2)
      
      # Check oracle property
      beta_hat = as.vector(coef(glmnetfit, s = "lambda.1se"))[-1]
      correct_all[i] = all((beta_hat != 0) == (beta != 0))
  }

  # Report simulation results for rho = 0.1
  mean(errors_all)
## [1] 1.143557
  mean(correct_all)
## [1] 0.76
  # Update Sigma for rho = 0.5
  rho = 0.5
  Sigma = matrix(rho, p, p)
  diag(Sigma) = 1

  # Run multiple simulations with rho = 0.5 (same loop structure as above)
  for (i in 1:nsim) {
      xtrain = mvrnorm(ntrain, rep(0, p), Sigma)
      xtest = mvrnorm(ntest, rep(0, p), Sigma)
      ytrain = xtrain %*% beta + rnorm(ntrain)
      ytest = xtest %*% beta + rnorm(ntest)
      
      glmnetfit = cv.glmnet(xtrain, ytrain, nfolds = 10)
      ypred = predict(glmnetfit, xtest, s = "lambda.1se")
      errors_all[i] = mean((ytest - ypred)^2)
      
      beta_hat = as.vector(coef(glmnetfit, s = "lambda.1se"))[-1]
      correct_all[i] = all((beta_hat != 0) == (beta != 0))
  }

  # Report simulation results for rho = 0.5
  mean(errors_all)
## [1] 1.107346
  mean(correct_all)
## [1] 0.31

The proportion of correct estimations decreased significantly when we increase the correlation. This is mainly because Lasso can be unstable when facing correlated variables, and essentially, using one or its correlated ones may lead to similar loss. Hence it may falsely select a variable instead of the original one.

Question 3: Elastic Net

In HW3, we used golub dataset from the multtest package. This dataset contains 3051 genes from 38 tumor mRNA samples from the leukemia microarray study Golub et al. (1999). The outcome golub.cl is an indicator for two leukemia types: Acute Lymphoblastic Leukemia (ALL) or Acute Myeloid Leukemia (AML). In genetic analysis, many gene expressions are highly correlated. Hence we could consider the Elastic net model for both sparsity and correlation.

Answer:

    # Load required packages
    library(glmnet)
    library(multtest)
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
    # Load Golub leukemia data
    data(golub)
    
    # Extract expression data (transposed for glmnet) and class labels
    X = t(as.matrix(golub))
    y = as.factor(golub.cl)
    
    # Initialize variables to keep track of results
    best_alpha = NULL
    best_lambda = NULL
    min_cv_error = Inf
    
    # List of alpha values to try
    alpha_values = seq(0, 1, by=0.1)
    
    # Loop over alpha values
    for (alpha in alpha_values) {
      
      # Fit elastic net model with current alpha value using cv.glmnet()
      cvfit = cv.glmnet(X, y, family="binomial", alpha=alpha, folds = 19)
      
      # Retrieve the mean cross-validated error for the optimal lambda
      cv_error = min(cvfit$cvm)
      
      # Update best_alpha and best_lambda if this model is better than previous best
      if (cv_error < min_cv_error) {
        best_alpha = alpha
        best_lambda = cvfit$lambda.min
        min_cv_error = cv_error
        best_model = cvfit
      }
    }
    
    # Output the best alpha and lambda
    cat("Best alpha:", best_alpha, "\n")
## Best alpha: 0.2
    cat("Best lambda:", best_lambda, "\n")
## Best lambda: 0.01957254