Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to reinstall all of your packages.During our lecture, we considered a simulation study using the following data generator:
\[Y = \sum_{j = 1}^p X_j 0.4^\sqrt{j} + \epsilon\]
And we added covariates one by one (in their numerical order, which is also the size of their effect) to observe the change of training error and testing error. However, in practice, we would not know the order of the variables. Hence several model selection tools were introduced. In this question, we will use similar data generators, with several nonzero effects, but use different model selection tools to find the best model. The goal is to understand the performance of model selection tools under various scenarios. Let’s first consider the following data generator:
\[Y = \frac{1}{2} \cdot X_1 + \frac{1}{4} \cdot X_2 + \frac{1}{8} \cdot X_3 + \frac{1}{16} \cdot X_4 + \epsilon\]
where \(\epsilon \sim N(0, 1)\) and \(X_j \sim N(0, 1)\) for \(j = 1, \ldots, p\). Write your code the complete the following tasks:
[10 points] Generate one dataset, with sample size \(n = 100\) and dimension \(p = 20\) as our lecture note. Perform best
subset selection (with the leaps
package) and use the AIC
criterion to select the best model. Report the best model and its
prediction error. Does the approach selects the correct
model, meaning that all the nonzero coefficient variables are
selected and all the zero coefficient variables are removed? Which
variable(s) was falsely selected and which variable(s) was falsely
removed? Do not consider the intercept term, since they
are always included in the model. Why do you think this
happens?
[10 points] Repeat the previous step with 100 runs of simulation, similar to our lecture note. Report
[10 points] In the previous question, you should be able to observe that the proportion of times that this approach selects the correct model is relatively low. This could be due to many reasons. Can you suggest some situations (setting of the model) or approaches (your model fitting procedure) for which the chance will be much improved (consider using AI tools if needed)? Implement that idea and verify the new selection rate and compare with the previous result. Furthermore,
We have introduced the formula of a linear regression
\[\widehat{\boldsymbol \beta} = (\mathbf{X}^\text{T} \mathbf{X})^{-1}\mathbf{X}^\text{T} \mathbf{y}\]
Let’s use the realestate
data as an example. The data
can be obtained from our course website. Here, \(\mathbf{X}\) is the design matrix with 414
observations and 4 columns: a column of 1 as the intercept, and
age
, distance
and stores
. \(\mathbf{y}\) is the outcome vector of
price
.
R
code to properly define both
\(\mathbf{X}\) and \(\mathbf{y}\), and then perform the linear
regression using the above formula. You cannot use lm()
for
this step. Report your \(\hat \beta\).
After getting your answer, compare that with the fitted coefficients
from the lm()
function. # load the data
realestate = read.csv("realestate.csv", row.names = 1)
price
with age
, distance
and
stores
), and then calculate the prediction error on the
testing data. Report your (mean) training error and testing (prediction)
error: \[\begin{align}
\text{Training Error} =& \frac{1}{n_\text{train}} \sum_{i \in
\text{Train}} (y_i - \hat y_i)^2 \\
\text{Testing Error} =& \frac{1}{n_\text{test}} \sum_{i \in
\text{Test}} (y_i - \hat y_i)^2
\end{align}\] Here \(y_i\) is
the original \(y\) value and \(\hat y_i\) is the fitted (for training
data) or predicted (for testing data) value. Which one do you expect to
be larger, and why? After carrying out your analysis, does the result
matches your expectation? If not, what could be the causes? # generate the indices for the testing data
set.seed(432)
test_idx = sample(nrow(realestate), 100)
step()
function (using all
covariates). Choose one among the AIC/BIC/Cp criterion to select the
best model. For the step()
function, you can use any
configuration you like, such as direction
etc. You should
still use the same training and testing ids defined previously. Report
your best model, training error and testing error.\[f(x) = \exp(1.5 \times x) - 3 \times (x + 6)^2 - 0.05 \times x^3\]
Write a function f_obj(x)
that calculates this objective
function. Plot this function on the domain \(x
\in [-40, 7]\).
[10 Points] Use the optim()
function to solve this
optimization problem. Use method = "BFGS"
. Try two initial
points: -15 and 0. Report Are the solutions you obtained different?
Why?
[10 Points] Consider a bi-variate function to minimize
\[ f(x,y) = 3x^2 + 2y^2 − 4xy + 6x − 5y + 7\]
Derive the partial derivatives of this function with respect to \(x\) and \(y\). And solve for the analytic solution of this function by applying the first-order conditions.
[10 Points] Check the second-order condition to verify that the solution you obtained in the previous step is indeed a minimum.
[5 Points] Use the optim()
function to solve this
optimization problem. Use method = "BFGS"
. Set your own
initial point. Report the solutions you obtained. Does different choices
of the initial point lead to different solutions? Why?