Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to reinstall all of your packages.During our lecture, we considered a simulation study using the following data generator:
\[Y = \sum_{j = 1}^p X_j 0.4^\sqrt{j} + \epsilon\]
And we added covariates one by one (in their numerical order, which is also the size of their effect) to observe the change of training error and testing error. However, in practice, we would not know the order of the variables. Hence several model selection tools were introduced. In this question, we will use similar data generators, with several nonzero effects, but use different model selection tools to find the best model. The goal is to understand the performence of model selection tools under various scenarios. Let’s first consider the following data generator:
\[Y = 0.5 + X_1 + 0.5 \cdot X_2 + 0.25 \cdot X_3 + 0.125 \cdot X_4 + \epsilon\]
where \(\epsilon \sim N(0, 1)\) and \(X_j \sim N(0, 1)\) for \(j = 1, \ldots, p\). Write your code the complete the following tasks:
leaps
package) and use the AIC
criterion to select the best model. Report the best model and its
prediction error. Does the appraoch selects the correct
model, meaning that all the nonzero coefficient variables are
selected and all the zero coefficient variables are removed? Which
variable(s) was falsely selected and which variable(s) was falsely
removed? Do not consider the intercept term, since they
are always included in the model.Answer:
# Load the required library
library(leaps)
# set seed
set.seed(1)
# generate data using the given data generator
n = 100
p = 20
x = matrix(rnorm(n*p), n, p)
colnames(x) <- paste0("X", 1:p)
y = 0.5 + x[, 1] + 0.5*x[, 2] + 0.25*x[, 3] + 0.125*x[, 4] + rnorm(n)
# Best subset selection
best_subsets <- regsubsets(x, y, nvmax = p)
sumleaps <- summary(best_subsets, matrix = T)
# Extract AIC for each model size
modelsize = apply(sumleaps$which,1,sum)
lm.fit <- lm(y ~ x)
AIC = n*log(sumleaps$rss/n) + 2*modelsize
# Plot AIC
plot(modelsize, AIC, xlab = "Model Size", ylab = "AIC", type = "b", col = "darkorange", lwd = 3)
# check if the model is correct
all(sumleaps$which[which.min(AIC), -1] == c(1, 1, 1, 1, rep(0, 16)))
## [1] FALSE
# the best model selected by AIC criterion
## The X8 and X13 are falsely selected, and X1~4 are correctly selection
(sumleaps$which[which.min(AIC), -1])
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
## TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
## X14 X15 X16 X17 X18 X19 X20
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE
For this particular dataset, our approach did not select the correct
model. Variable X_5
is selected and variable
X_4
is not selected. This is because the AIC criterion is
not perfect.
Answer:
nsim = 100
n = 100
p = 20
selection = matrix(NA, nsim, p)
correct = rep(NA, nsim)
for (i in 1:nsim)
{
# the data generator
x = matrix(rnorm(n*p), n, p)
y = 0.5 + x[, 1] + 0.5*x[, 2] + 0.25*x[, 3] + 0.125*x[, 4] + rnorm(n)
# Best subset selection
best_subsets <- regsubsets(x, y, nvmax = p)
sumleaps <- summary(best_subsets, matrix = T)
# Extract AIC for each model size
modelsize = apply(sumleaps$which,1,sum)
lm.fit <- lm(y ~ x)
AIC = n*log(sumleaps$rss/n) + 2*modelsize
# check if the model is correct
selection[i, ] = sumleaps$which[which.min(AIC), -1]
correct[i] = all(selection[i, ] == c(1, 1, 1, 1, rep(0, 16)))
}
mean(correct)
## [1] 0.01
plot(colMeans(selection), pch = 19, ylim = c(0, 1), cex = 2,
xlab = "Variable Index", ylab = "Proportion of Selection",
main = "Proportion of Selection for Each Variable",
col = c(rep("darkorange", 4), rep("blue", 16)))
The propotion of time that this appraoch selects the correct model is 0.01, which is quite low.
1) Discuss each of the settings or appraoches you have altered and explain why it can improve the selection rate.
2) If you use AI tools, discuss your experience with it. Such as how to write the prompt and whether you had to further modeify the code.
Answer:
Here is my code, without using AI tools. I changed from AIC to BIC because BIC has a better chance to select the true variables, while AIC concerns mostly the prediction accuracy. I further made two changes:
nsim = 100
n = 500
p = 20
selection = matrix(NA, nsim, p)
correct = rep(NA, nsim)
for (i in 1:nsim)
{
# the data generator
x = matrix(rnorm(n*p), n, p)
y = 0.5 + x[, 1] + 0.5*x[, 2] + 0.25*x[, 3] + 0.125*x[, 4] + rnorm(n)
# Best subset selection
best_subsets <- regsubsets(x, y, nvmax = p)
sumleaps <- summary(best_subsets, matrix = T)
# Extract AIC for each model size
modelsize = apply(sumleaps$which,1,sum)
lm.fit <- lm(y ~ x)
# I changed the AIC to BIC
BIC = n*log(sumleaps$rss/n) + modelsize*log(n);
# AIC = n*log(sumleaps$rss/n) + 2*modelsize
# check if the model is correct
selection[i, ] = sumleaps$which[which.min(BIC), -1]
correct[i] = all(selection[i, ] == c(1, 1, 1, 1, rep(0, 16)))
}
mean(correct)
## [1] 0.58
plot(colMeans(selection), pch = 19, ylim = c(0, 1), cex = 2,
xlab = "Variable Index", ylab = "Proportion of Selection",
main = "Proportion of Selection for Each Variable",
col = c(rep("darkorange", 4), rep("blue", 16)))
Overall, this boosts the selection rate to 0.58.
Alternatively, I tested the capability of using GPT-4. Here is the prompt
I have this code: “…” (I attached our class note code chunk). I want to adapt it to a version that uses best subset selection with AIC, using the leaps package. Can you help me with this?
Basically it suggested some nonsense. And ultimately, I cannot use GPT-4 to produce a code that do exactly what I wanted. But here are the steps I followed to obtain a basic structure:
And up to here, it basically gave me the code structure I wanted. Then I went on modified the code slightly which works pretty much the same as my previous one. Hence the experience was not great, but this is only because there is no example code in the training process that can be used to suggest such a complicated task. With a step-by-step approach, you can still utilize its capability of generating code.
We have introduced the formula of a linear regression
\[\widehat{\boldsymbol \beta} = (\mathbf{X}^\text{T} \mathbf{X})^{-1}\mathbf{X}^\text{T} \mathbf{y}\]
Let’s use the realestate
data as an example. The data
can be obtained from our course website. Here, \(\mathbf{X}\) is the design matrix with 414
observations and 4 columns: a column of 1 as the intercept, and
age
, distance
and stores
. \(\mathbf{y}\) is the outcome vector of
price
.
R
code to properly define both
\(\mathbf{X}\) and \(\mathbf{y}\), and then perform the linear
regression using the above formula. You cannot use lm()
for
this step. Report your \(\hat \beta\).
After getting your answer, compare that with the fitted coefficients
from the lm()
function.Answer:
# read in and define X and y
realestate = read.csv("realestate.csv", row.names = 1)
y = realestate$price
X = data.matrix(cbind(1, realestate[, c("age", "distance", "stores")]))
# solve linear regression
b = solve(t(X) %*% X) %*% t(X) %*% y
print(b)
## [,1]
## 1 42.97728621
## age -0.25285583
## distance -0.00537913
## stores 1.29744248
# to compare that with the fitted coefficients from the lm() function
lm(price ~ age + distance + stores, data = realestate)
##
## Call:
## lm(formula = price ~ age + distance + stores, data = realestate)
##
## Coefficients:
## (Intercept) age distance stores
## 42.977286 -0.252856 -0.005379 1.297442
The coefficients are the same.
price
with age
, distance
and
stores
), and then calculate the prediction error on the
testing data. Report your (mean) training error and testing (prediction)
error: \[\begin{align}
\text{Training Error} =& \frac{1}{n_\text{train}} \sum_{i \in
\text{Train}} (y_i - \hat y_i)^2 \\
\text{Testing Error} =& \frac{1}{n_\text{test}} \sum_{i \in
\text{Test}} (y_i - \hat y_i)^2
\end{align}\] Here \(y_i\) is
the original \(y\) value and \(\hat y_i\) is the fitted (for training
data) or predicted (for testing data) value. Which one do you expect to
be larger, and why? After carrying out your analysis, does the result
matches your expectation? If not, what could be the causes? # generate the indices for the testing data
set.seed(432)
test_idx = sample(nrow(realestate), 100)
Answer:
# define training and testing data
y = realestate$price
X = data.matrix(cbind(1, realestate[, c("age", "distance", "stores")]))
set.seed(432)
test_idx = sample(nrow(realestate), 100)
trainx = X[ -test_idx, ]
testx = X[ test_idx, ]
trainy = y[ -test_idx ]
testy = y[ test_idx ]
# fit linear regression using my own code
b = solve(t(trainx) %*% trainx) %*% t(trainx) %*% trainy
# predicting on the testing data
pred = testx %*% b
# training error
mean((trainy - trainx %*% b)^2)
## [1] 74.57346
# testing error
mean((testy - pred)^2)
## [1] 119.4458
# the overall variance of y
var(y)
## [1] 185.1365
The testing error is slightly higher. This is expected since the training error can always be affected by over-fitting.
step()
function (using all
covariates). Choose one among the AIC/BIC/Cp criterion to select the
best model. For the step()
function, you can use any
configuration you like, such as direction
etc. You should
still use the same trainning and testing ids defined previously. Report
your best model, trainning error and testing error.Answer:
realestate = read.csv("realestate.csv", row.names = 1)
train = realestate[-test_idx, ]
test = realestate[test_idx, ]
lm_fit = lm(price ~ ., data = train)
lm_fit
##
## Call:
## lm(formula = price ~ ., data = train)
##
## Coefficients:
## (Intercept) date age distance stores latitude
## -1.118e+04 4.438e+00 -3.098e-01 -4.889e-03 9.931e-01 2.140e+02
## longitude
## -2.509e+01
# perform step wise regression using AIC
step(lm_fit, direction = "both", trace = 0)
##
## Call:
## lm(formula = price ~ date + age + distance + stores + latitude,
## data = train)
##
## Coefficients:
## (Intercept) date age distance stores latitude
## -1.439e+04 4.467e+00 -3.098e-01 -4.613e-03 9.964e-01 2.178e+02
# the best model
lm_fit = lm(price ~ date + age + distance + stores + latitude, data = train)
# prediction error
pred = predict(lm_fit, newdata = test)
pred_err = mean((pred - test$price)^2)
print(pred_err)
## [1] 106.2898
# training error
pred = predict(lm_fit, newdata = train)
pred_err = mean((pred - train$price)^2)
print(pred_err)
## [1] 68.52164
This model is slightly better than the previous one.
Consider minimizing the following function:
\[f(x) = \exp(x) - 2.5 \times (x + 6)^2 - 0.05 \times x^3\]
f_obj(x)
that calculates
this objective function. Plot this function on the domain \(x \in [-40, 7]\).Answer:
f_obj <- function(x) exp(x) - 2.5*(x+6)^2 - 0.05*x^3
x = seq(-40, 7, 0.01)
plot(x, f_obj(x), type = "l", col = "darkorange", lwd = 2, ylab = "f(x)")
optim()
function to solve this
optimization problem. Use method = "BFGS"
. Try two initial
points: -15 and 0. Report Are the solutions you obtained different?
Why?Answer:
# perform BFGS with different intializations
sol1 = optim(-15, fn = f_obj, method = "BFGS")
print(c(sol1$par, sol1$value))
## [1] -25.48584 -121.55652
sol2 = optim(0, fn = f_obj, method = "BFGS")
print(c(sol2$par, sol2$value))
## [1] 3.953841 -198.652634
The two solutions are different. One is at -25.4858375, and other one at 3.9538407. This is because BFGS method only converged to a local minimum. Only the latter one corresponds to a global minimum.