Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, there are two basic rules:

Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. Please refer to the course website for late submission policy and grading rubrics.

Question 1 (Continuing the Simulation Study)

During our lecture, we considered a simulation study using the following data generator:

\[Y = \sum_{j = 1}^p X_j 0.4^\sqrt{j} + \epsilon\]

And we added covariates one by one (in their numerical order, which is also the size of their effect) to observe the change of training error and testing error. However, in practice, we would not know the order of the variables. Hence several model selection tools were introduced. In this question, we will use a similar data generator, but sparse, i.e., with several nonzero effects, but use different model selection tools to find the best model. The goal is to understand the performance of model selection tools under various scenarios. Let’s first consider the following data generator:

\[Y = \frac{1}{2} \cdot X_1 + \frac{1}{4} \cdot X_2 + \frac{1}{8} \cdot X_3 + \epsilon\]

where \(\epsilon \sim N(0, 1)\) and \(X_j \sim N(0, 1)\) for \(j = 1, \ldots, p\) are all independent. Write your code the complete the following tasks:

  1. [10 points] Generate one dataset, with sample size \(n = 100\) and dimension \(p = 8\) as our lecture note. Perform best subset selection (with the leaps package) and use the AIC criterion to select the best model.

    • Report the best model and its prediction error.
    • Does the approach selects the correct model, meaning that all the nonzero coefficient variables are selected and all the zero coefficient variables are removed?
    • Which variable(s) was falsely selected and which variable(s) was falsely removed? Do not consider the intercept term, since they are always included in the model.
  2. [5 points] Repeat the previous step with 100 runs of simulation, similar to our lecture note. Report

    1. the proportion of times that each variable was selected and present that using a plot.
    2. the proportion of times that this approach selects the correct model
  3. [5 points] Now, increase the number of variables to \(p = 30\). Repeat the previous step. Report

    1. The proportion of times that each variable was selected and present that using a plot.
    2. The proportion of times that this approach selects the correct model
    3. Compare this result with the previous one and discuss the impact of increasing \(p\). Do you expect the prediction error under this setting to be larger or smaller than the previous one? Why?
  4. [10 points] With the help of AI, there could be some interesting approaches trying to improve the model selection performance. Work with AI models and try a new model selection approach. You cannot change the setting of the data generator (i.e., \(n = 100\), \(p = 30\) and the data generator should be the same as above). Fully describe your approach and report the results of

    • The average proportion of times that each variable was selected and present that using a plot.
    • The proportion of times that this approach selects the correct model

Note that your approach does not necessarily need to be better than the previous one. However, try to explain the logic why it has the potential. But also provide some discussion on the downside of your approach.

Question 2 (Training and Testing of Linear Regression)

We have introduced the formula of a linear regression

\[\widehat{\boldsymbol \beta} = (\mathbf{X}^\text{T} \mathbf{X})^{-1}\mathbf{X}^\text{T} \mathbf{y}\]

Let’s use the realestate data as an example. The data can be obtained from our course website. Here, \(\mathbf{X}\) is the design matrix with 414 observations and 4 columns: a column of 1 as the intercept, and age, distance and stores. \(\mathbf{y}\) is the outcome vector of price.

  1. [10 points] Write an R code to properly define both \(\mathbf{X}\) and \(\mathbf{y}\), and then perform the linear regression using the above formula. You cannot use lm() for this step. Report your \(\hat \beta\). After getting your answer, compare that with the fitted coefficients from the lm() function.

  2. [10 points] Split your data into two parts: a testing data that contains 100 observations, and the rest as training data. Use the following code to generate the ids for the testing data. Use your previous code to fit a linear regression model (predict price with age, distance and stores), and then calculate the prediction error on the testing data. Report your (mean) training error and testing (prediction) error:

\[\begin{align} \text{Training Error} =& \frac{1}{n_\text{train}} \sum_{i \in \text{Train}} (y_i - \hat y_i)^2 \\ \text{Testing Error} =& \frac{1}{n_\text{test}} \sum_{i \in \text{Test}} (y_i - \hat y_i)^2 \end{align}\]

Here \(y_i\) is the original \(y\) value and \(\hat y_i\) is the fitted (for training data) or predicted (for testing data) value. Which one do you expect to be larger, and why? After carrying out your analysis, does the result matches your expectation? If not, what could be the causes?

  # generate the indices for the testing data
  set.seed(432)
  test_idx = sample(nrow(realestate), 100)
  1. [10 points] Alternatively, you can always use built-in functions to fit linear regression. Setup your code to perform a step-wise linear regression using the step() function (using all covariates). Choose one among the AIC/BIC/Cp criterion to select the best model. For the step() function, you can use any configuration you like, such as direction etc. You should still use the same training and testing ids defined previously. Report your best model, training error and testing error.

Question 3 (Optimization)

  1. [5 Points] Consider minimizing the following univariate function:

\[f(x) = \exp(1.5 \times x) - 3 \times (x + 6)^2 - 0.05 \times x^3\]

Write a function f_obj(x) that calculates this objective function. Plot this function on the domain \(x \in [-40, 7]\).

  1. [10 Points] Use the optim() function to solve this optimization problem. Use method = "BFGS". Try two initial points: -15 and 0. Report Are the solutions you obtained different? Why?

  2. [10 Points] Consider a bi-variate function to minimize

\[ f(x,y) = 3x^2 + 2y^2 − 4xy + 6x − 5y + 7\]

Derive the partial derivatives of this function with respect to \(x\) and \(y\). And solve for the analytic solution of this function by applying the first-order conditions.

  1. [10 Points] Check the second-order condition to verify that the solution you obtained in the previous step is indeed a minimum.

  2. [5 Points] Use the optim() function to solve this optimization problem. Use method = "BFGS". Set your own initial point. Report the solutions you obtained. Does different choices of the initial point lead to different solutions? Why?