Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: Lasso Simulation

During our lecture, we considered a simulation model to analyze the variable selection property of Lasso. Now let’s further investigate the prediction error cased by the \(L1\) penalty under this model, and understand the bias-variance trade-off. For this question, your underlying true data generate should be

\[\begin{align} Y =& X^\text{T} \boldsymbol \beta + \epsilon \\ =& \sum_{j = 1}^p X_j 0.4^\sqrt{j} + \epsilon, \end{align}\]

where p\(= 30\), each \(X_j\) is generated independently from \(\cal{N}(0, 1)\), and \(\epsilon\) also follows a standard normal, independent from \(X\). The goal is to predict two target points and investigate how the prediction error changes under different penalties. The training data and two target testing points are defined by the following code.

    # target testing points
    p = 30
    xa = xb = rep(0, p)
    xa[2] = 1
    xb[10] = 1

Perform the following questions:

Question 2: Lasso with Correlated Variables

A key challenge in using Lasso regression is its potential difficulty in accurately selecting variables when predictors are highly correlated. This assignment is a simulation study designed to investigate the impact of variable correlation on the performance of the Lasso algorithm. Consider the linear model defined as:

\[ Y = X^\text{T} \boldsymbol \beta + \epsilon \]

Where \(\boldsymbol \beta = (\beta_1, \beta_2, \ldots, \beta_{30})^T\) with \(\beta_1 = \beta_{11} = \beta_{21} = 0.4\) and all other \(\beta\) parameters set to zero. The \(p\)-dimensional covariate \(X\) follows a multivariate Gaussian distribution:

\[ \mathbf{X} \sim {\cal N}(\mathbf{0}, \Sigma_{p\times p}). \]

In \(\Sigma\), all diagonal elements are 1, and all off-diagonal elements are \(\rho\).

a) Single Simulation Run [15 pts]

  • Generate 300 training and 100 testing samples independently based on the above model.
  • Use \(\rho = 0.1\).
  • Fit a Lasso model using cv.glmnet() on the training data with 10-fold cross-validation. Use lambda.1se to select the optimal \(\lambda\).
  • Report:
    • Prediction error (MSE) on the test data using lambda.1se.
    • Whether the correct model was selected (i.e., whether the nonzero variables are correctly identified and zero variables are correctly excluded). You may refer to HW3 for this property.

b) Multiple Runs [15 pts]

  • Perform 100 simulation runs as in part a).
  • For each run, record the prediction error and the oracle status of the selected variables.
  • Report the average prediction error and the proportion of runs where correct model was selected.

c) Impact of Increased Correlation [10 pts]

  • Repeat task b) with \(\rho = 0.5\).
  • Report the average prediction error and the proportion of oracle estimations.
  • Discuss the reasons behind any observed changes in the proportion of oracle estimations when \(\rho\) changes from 0.1 to 0.5.

Question 3: Elastic Net

In HW3, we used golub dataset from the multtest package. This dataset contains 3051 genes from 38 tumor mRNA samples from the leukemia microarray study Golub et al. (1999). The outcome golub.cl is an indicator for two leukemia types: Acute Lymphoblastic Leukemia (ALL) or Acute Myeloid Leukemia (AML). In genetic analysis, many gene expressions are highly correlated. Hence we could consider the Elastic net model for both sparsity and correlation.