Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1 (Continuing the Simulation Study)

During our lecture, we considered a simulation study using the following data generator:

\[Y = \sum_{j = 1}^p X_j 0.4^\sqrt{j} + \epsilon\]

And we added covariates one by one (in their numerical order, which is also the size of their effect) to observe the change of training error and testing error. However, in practice, we would not know the order of the variables. Hence several model selection tools were introduced. In this question, we will use similar data generators, with several nonzero effects, but use different model selection tools to find the best model. The goal is to understand the performance of model selection tools under various scenarios. Let’s first consider the following data generator:

\[Y = \frac{1}{2} \cdot X_1 + \frac{1}{4} \cdot X_2 + \frac{1}{8} \cdot X_3 + \frac{1}{16} \cdot X_4 + \epsilon\]

where \(\epsilon \sim N(0, 1)\) and \(X_j \sim N(0, 1)\) for \(j = 1, \ldots, p\). Write your code the complete the following tasks:

  1. [10 points] Generate one dataset, with sample size \(n = 100\) and dimension \(p = 20\) as our lecture note. Perform best subset selection (with the leaps package) and use the AIC criterion to select the best model. Report the best model and its prediction error. Does the approach selects the correct model, meaning that all the nonzero coefficient variables are selected and all the zero coefficient variables are removed? Which variable(s) was falsely selected and which variable(s) was falsely removed? Do not consider the intercept term, since they are always included in the model. Why do you think this happens?

  2. [10 points] Repeat the previous step with 100 runs of simulation, similar to our lecture note. Report

    1. the proportion of times that this approach selects the correct model
    2. the proportion of times that each variable was selected
  3. [10 points] In the previous question, you should be able to observe that the proportion of times that this approach selects the correct model is relatively low. This could be due to many reasons. Can you suggest some situations (setting of the model) or approaches (your model fitting procedure) for which the chance will be much improved (consider using AI tools if needed)? Implement that idea and verify the new selection rate and compare with the previous result. Furthermore,

    1. Discuss each of the settings or appraoches you have altered and explain why it can improve the selection rate.
    2. If you use AI tools, discuss your experience with it. Such as how to write the prompt and whether you had to further modeify the code.

Question 2 (Training and Testing of Linear Regression)

We have introduced the formula of a linear regression

\[\widehat{\boldsymbol \beta} = (\mathbf{X}^\text{T} \mathbf{X})^{-1}\mathbf{X}^\text{T} \mathbf{y}\]

Let’s use the realestate data as an example. The data can be obtained from our course website. Here, \(\mathbf{X}\) is the design matrix with 414 observations and 4 columns: a column of 1 as the intercept, and age, distance and stores. \(\mathbf{y}\) is the outcome vector of price.

  1. [10 points] Write an R code to properly define both \(\mathbf{X}\) and \(\mathbf{y}\), and then perform the linear regression using the above formula. You cannot use lm() for this step. Report your \(\hat \beta\). After getting your answer, compare that with the fitted coefficients from the lm() function.
    # load the data
    realestate = read.csv("realestate.csv", row.names = 1)
  1. [10 points] Split your data into two parts: a testing data that contains 100 observations, and the rest as training data. Use the following code to generate the ids for the testing data. Use your previous code to fit a linear regression model (predict price with age, distance and stores), and then calculate the prediction error on the testing data. Report your (mean) training error and testing (prediction) error: \[\begin{align} \text{Training Error} =& \frac{1}{n_\text{train}} \sum_{i \in \text{Train}} (y_i - \hat y_i)^2 \\ \text{Testing Error} =& \frac{1}{n_\text{test}} \sum_{i \in \text{Test}} (y_i - \hat y_i)^2 \end{align}\] Here \(y_i\) is the original \(y\) value and \(\hat y_i\) is the fitted (for training data) or predicted (for testing data) value. Which one do you expect to be larger, and why? After carrying out your analysis, does the result matches your expectation? If not, what could be the causes?
  # generate the indices for the testing data
  set.seed(432)
  test_idx = sample(nrow(realestate), 100)
  1. [10 points] Alternatively, you can always use built-in functions to fit linear regression. Setup your code to perform a step-wise linear regression using the step() function (using all covariates). Choose one among the AIC/BIC/Cp criterion to select the best model. For the step() function, you can use any configuration you like, such as direction etc. You should still use the same training and testing ids defined previously. Report your best model, training error and testing error.

Question 3 (Optimization)

  1. [5 Points] Consider minimizing the following univariate function:

\[f(x) = \exp(1.5 \times x) - 3 \times (x + 6)^2 - 0.05 \times x^3\]

Write a function f_obj(x) that calculates this objective function. Plot this function on the domain \(x \in [-40, 7]\).

  1. [10 Points] Use the optim() function to solve this optimization problem. Use method = "BFGS". Try two initial points: -15 and 0. Report Are the solutions you obtained different? Why?

  2. [10 Points] Consider a bi-variate function to minimize

\[ f(x,y) = 3x^2 + 2y^2 − 4xy + 6x − 5y + 7\]

Derive the partial derivatives of this function with respect to \(x\) and \(y\). And solve for the analytic solution of this function by applying the first-order conditions.

  1. [10 Points] Check the second-order condition to verify that the solution you obtained in the previous step is indeed a minimum.

  2. [5 Points] Use the optim() function to solve this optimization problem. Use method = "BFGS". Set your own initial point. Report the solutions you obtained. Does different choices of the initial point lead to different solutions? Why?