Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, there are two basic rules:
Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. Please refer to the course website for late submission policy and grading rubrics.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format will not be
accepted because they are often not readable on gradescope.
Make all of your R
code chunks visible for
grading.R
is \(\geq 4.0.0\). This
will ensure your random seed generation is the same as everyone
else..Rmd
file
as a template, be sure to remove this instruction
section.During our lecture, we considered a simulation study using the following data generator:
\[Y = \sum_{j = 1}^p X_j 0.4^\sqrt{j} + \epsilon\]
And we added covariates one by one (in their numerical order, which is also the size of their effect) to observe the change of training error and testing error. However, in practice, we would not know the order of the variables. Hence several model selection tools were introduced. In this question, we will use a similar data generator, but sparse, i.e., with several nonzero effects, but use different model selection tools to find the best model. The goal is to understand the performance of model selection tools under various scenarios. Let’s first consider the following data generator:
\[Y = \frac{1}{2} \cdot X_1 + \frac{1}{4} \cdot X_2 + \frac{1}{8} \cdot X_3 + \epsilon\]
where \(\epsilon \sim N(0, 1)\) and \(X_j \sim N(0, 1)\) for \(j = 1, \ldots, p\) are all independent. Write your code the complete the following tasks:
[10 points] Generate one dataset, with sample size \(n = 100\) and dimension \(p = 8\) as our lecture note. Perform best
subset selection (with the leaps
package) and use the AIC
criterion to select the best model.
[5 points] Repeat the previous step with 100 runs of simulation, similar to our lecture note. Report
[5 points] Now, increase the number of variables to \(p = 30\). Repeat the previous step. Report
[10 points] With the help of AI, there could be some interesting approaches trying to improve the model selection performance. Work with AI models and try a new model selection approach. You cannot change the setting of the data generator (i.e., \(n = 100\), \(p = 30\) and the data generator should be the same as above). Fully describe your approach and report the results of
Note that your approach does not necessarily need to be better than the previous one. However, try to explain the logic why it has the potential. But also provide some discussion on the downside of your approach.
We have introduced the formula of a linear regression
\[\widehat{\boldsymbol \beta} = (\mathbf{X}^\text{T} \mathbf{X})^{-1}\mathbf{X}^\text{T} \mathbf{y}\]
Let’s use the realestate
data as an example. The data
can be obtained from our course website. Here, \(\mathbf{X}\) is the design matrix with 414
observations and 4 columns: a column of 1 as the intercept, and
age
, distance
and stores
. \(\mathbf{y}\) is the outcome vector of
price
.
[10 points] Write an R
code to properly define both
\(\mathbf{X}\) and \(\mathbf{y}\), and then perform the linear
regression using the above formula. You cannot use lm()
for
this step. Report your \(\hat \beta\).
After getting your answer, compare that with the fitted coefficients
from the lm()
function.
[10 points] Split your data into two parts: a testing data that
contains 100 observations, and the rest as training data. Use the
following code to generate the ids for the testing data. Use your
previous code to fit a linear regression model (predict
price
with age
, distance
and
stores
), and then calculate the prediction error on the
testing data. Report your (mean) training error and testing (prediction)
error:
\[\begin{align} \text{Training Error} =& \frac{1}{n_\text{train}} \sum_{i \in \text{Train}} (y_i - \hat y_i)^2 \\ \text{Testing Error} =& \frac{1}{n_\text{test}} \sum_{i \in \text{Test}} (y_i - \hat y_i)^2 \end{align}\]
Here \(y_i\) is the original \(y\) value and \(\hat y_i\) is the fitted (for training data) or predicted (for testing data) value. Which one do you expect to be larger, and why? After carrying out your analysis, does the result matches your expectation? If not, what could be the causes?
# generate the indices for the testing data
set.seed(432)
test_idx = sample(nrow(realestate), 100)
step()
function (using all
covariates). Choose one among the AIC/BIC/Cp criterion to select the
best model. For the step()
function, you can use any
configuration you like, such as direction
etc. You should
still use the same training and testing ids defined previously. Report
your best model, training error and testing error.\[f(x) = \exp(1.5 \times x) - 3 \times (x + 6)^2 - 0.05 \times x^3\]
Write a function f_obj(x)
that calculates this objective
function. Plot this function on the domain \(x
\in [-40, 7]\).
[10 Points] Use the optim()
function to solve this
optimization problem. Use method = "BFGS"
. Try two initial
points: -15 and 0. Report Are the solutions you obtained different?
Why?
[10 Points] Consider a bi-variate function to minimize
\[ f(x,y) = 3x^2 + 2y^2 − 4xy + 6x − 5y + 7\]
Derive the partial derivatives of this function with respect to \(x\) and \(y\). And solve for the analytic solution of this function by applying the first-order conditions.
[10 Points] Check the second-order condition to verify that the solution you obtained in the previous step is indeed a minimum.
[5 Points] Use the optim()
function to solve this
optimization problem. Use method = "BFGS"
. Set your own
initial point. Report the solutions you obtained. Does different choices
of the initial point lead to different solutions? Why?