Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1 (Continuing the Simulation Study)

During our lecture, we considered a simulation study using the following data generator:

\[Y = \sum_{j = 1}^p X_j 0.4^\sqrt{j} + \epsilon\]

And we added covariates one by one (in their numerical order, which is also the size of their effect) to observe the change of training error and testing error. However, in practice, we would not know the order of the variables. Hence several model selection tools were introduced. In this question, we will use similar data generators, with several nonzero effects, but use different model selection tools to find the best model. The goal is to understand the performence of model selection tools under various scenarios. Let’s first consider the following data generator:

\[Y = 0.5 + X_1 + 0.5 \cdot X_2 + 0.25 \cdot X_3 + 0.125 \cdot X_4 + \epsilon\]

where \(\epsilon \sim N(0, 1)\) and \(X_j \sim N(0, 1)\) for \(j = 1, \ldots, p\). Write your code the complete the following tasks:

  1. [10 points] Genereate one dataset, with sample size \(n = 100\) and dimension \(p = 20\) as our lecture note. Perform best subset selection (with the leaps package) and use the AIC criterion to select the best model. Report the best model and its prediction error. Does the appraoch selects the correct model, meaning that all the nonzero coefficient variables are selected and all the zero coefficient variables are removed? Which variable(s) was falsely selected and which variable(s) was falsely removed? Do not consider the intercept term, since they are always included in the model.

Answer:

  # Load the required library
  library(leaps)
  
  # set seed
  set.seed(1)

  # generate data using the given data generator
  n = 100
  p = 20
  x = matrix(rnorm(n*p), n, p)
  colnames(x) <- paste0("X", 1:p)
  y = 0.5 + x[, 1] + 0.5*x[, 2] + 0.25*x[, 3] + 0.125*x[, 4] + rnorm(n)

  # Best subset selection
  best_subsets <- regsubsets(x, y, nvmax = p)
  sumleaps <- summary(best_subsets, matrix = T)

  # Extract AIC for each model size
  modelsize = apply(sumleaps$which,1,sum)

  lm.fit <- lm(y ~ x)

  AIC = n*log(sumleaps$rss/n) + 2*modelsize

  # Plot AIC
  plot(modelsize, AIC, xlab = "Model Size", ylab = "AIC", type = "b", col = "darkorange", lwd = 3)

  # check if the model is correct
  all(sumleaps$which[which.min(AIC), -1] == c(1, 1, 1, 1, rep(0, 16)))
## [1] FALSE
  # the best model selected by AIC criterion
  ## The X8 and X13 are falsely selected, and X1~4 are correctly selection
  (sumleaps$which[which.min(AIC), -1])  
##    X1    X2    X3    X4    X5    X6    X7    X8    X9   X10   X11   X12   X13 
##  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE 
##   X14   X15   X16   X17   X18   X19   X20 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE

For this particular dataset, our approach did not select the correct model. Variable X_5 is selected and variable X_4 is not selected. This is because the AIC criterion is not perfect.

  1. [15 points] Repeat the previous step with 100 runs of simulation, similar to our lecture note. Report 1) the propotion of times that this appraoch selects the correct model 2) the propotion of times that each variabe was selected

Answer:

  nsim = 100
  n = 100
  p = 20

  selection = matrix(NA, nsim, p)
  correct = rep(NA, nsim)

  for (i in 1:nsim)
  {
    # the data generator
    x = matrix(rnorm(n*p), n, p)
    y =  0.5 + x[, 1] + 0.5*x[, 2] + 0.25*x[, 3] + 0.125*x[, 4] + rnorm(n)
    
    # Best subset selection
    best_subsets <- regsubsets(x, y, nvmax = p)
    sumleaps <- summary(best_subsets, matrix = T)

    # Extract AIC for each model size
    modelsize = apply(sumleaps$which,1,sum)
    lm.fit <- lm(y ~ x)
    AIC = n*log(sumleaps$rss/n) + 2*modelsize

    # check if the model is correct
    selection[i, ] = sumleaps$which[which.min(AIC), -1]
    correct[i] = all(selection[i, ] == c(1, 1, 1, 1, rep(0, 16)))
  }

  mean(correct)
## [1] 0.01
  plot(colMeans(selection), pch = 19, ylim = c(0, 1), cex = 2,
  xlab = "Variable Index", ylab = "Proportion of Selection", 
  main = "Proportion of Selection for Each Variable", 
  col = c(rep("darkorange", 4), rep("blue", 16)))

The propotion of time that this appraoch selects the correct model is 0.01, which is quite low.

  1. [20 points] In the previous question, you should be able to observe that the propotion of times that this appraoch selects the correct model is relatively low. This could be due to many reasons. Can you suggest some situations (setting of the model) or appraoches (your model fitting procedure) for which the chance will be much improved (consider using AI tools if needed)? Implement that idea and varify the new selection rate and compare with the previous result. Furthermore,
1) Discuss each of the settings or appraoches you have altered and explain why it can improve the selection rate.
2) If you use AI tools, discuss your experience with it. Such as how to write the prompt and whether you had to further modeify the code.

Answer:

Here is my code, without using AI tools. I changed from AIC to BIC because BIC has a better chance to select the true variables, while AIC concerns mostly the prediction accuracy. I further made two changes:

  nsim = 100
  n = 500
  p = 20

  selection = matrix(NA, nsim, p)
  correct = rep(NA, nsim)

  for (i in 1:nsim)
  {
    # the data generator
    x = matrix(rnorm(n*p), n, p)
    y =  0.5 + x[, 1] + 0.5*x[, 2] + 0.25*x[, 3] + 0.125*x[, 4] + rnorm(n)
    
    # Best subset selection
    best_subsets <- regsubsets(x, y, nvmax = p)
    sumleaps <- summary(best_subsets, matrix = T)

    # Extract AIC for each model size
    modelsize = apply(sumleaps$which,1,sum)
    lm.fit <- lm(y ~ x)

    # I changed the AIC to BIC
    BIC = n*log(sumleaps$rss/n) + modelsize*log(n);
    # AIC = n*log(sumleaps$rss/n) + 2*modelsize

    # check if the model is correct
    selection[i, ] = sumleaps$which[which.min(BIC), -1]
    correct[i] = all(selection[i, ] == c(1, 1, 1, 1, rep(0, 16)))
  }

  mean(correct)
## [1] 0.58
  plot(colMeans(selection), pch = 19, ylim = c(0, 1), cex = 2,
  xlab = "Variable Index", ylab = "Proportion of Selection", 
  main = "Proportion of Selection for Each Variable", 
  col = c(rep("darkorange", 4), rep("blue", 16)))

Overall, this boosts the selection rate to 0.58.

Alternatively, I tested the capability of using GPT-4. Here is the prompt

I have this code: “…” (I attached our class note code chunk). I want to adapt it to a version that uses best subset selection with AIC, using the leaps package. Can you help me with this?

Basically it suggested some nonsense. And ultimately, I cannot use GPT-4 to produce a code that do exactly what I wanted. But here are the steps I followed to obtain a basic structure:

And up to here, it basically gave me the code structure I wanted. Then I went on modified the code slightly which works pretty much the same as my previous one. Hence the experience was not great, but this is only because there is no example code in the training process that can be used to suggest such a complicated task. With a step-by-step approach, you can still utilize its capability of generating code.

Question 2 (Training and Testing of Linear Regression)

We have introduced the formula of a linear regression

\[\widehat{\boldsymbol \beta} = (\mathbf{X}^\text{T} \mathbf{X})^{-1}\mathbf{X}^\text{T} \mathbf{y}\]

Let’s use the realestate data as an example. The data can be obtained from our course website. Here, \(\mathbf{X}\) is the design matrix with 414 observations and 4 columns: a column of 1 as the intercept, and age, distance and stores. \(\mathbf{y}\) is the outcome vector of price.

  1. [15 points ] Write an R code to properly define both \(\mathbf{X}\) and \(\mathbf{y}\), and then perform the linear regression using the above formula. You cannot use lm() for this step. Report your \(\hat \beta\). After getting your answer, compare that with the fitted coefficients from the lm() function.

Answer:

  # read in and define X and y
  realestate = read.csv("realestate.csv", row.names = 1)
  y = realestate$price
  X = data.matrix(cbind(1, realestate[, c("age", "distance", "stores")]))

  # solve linear regression
  b = solve(t(X) %*% X) %*% t(X) %*% y
  print(b)
##                 [,1]
## 1        42.97728621
## age      -0.25285583
## distance -0.00537913
## stores    1.29744248
  # to compare that with the fitted coefficients from the lm() function
  lm(price ~ age + distance + stores, data = realestate)
## 
## Call:
## lm(formula = price ~ age + distance + stores, data = realestate)
## 
## Coefficients:
## (Intercept)          age     distance       stores  
##   42.977286    -0.252856    -0.005379     1.297442

The coefficients are the same.

  1. [10 pts] Split your data into two parts: a testing data that contains 100 observations, and the rest as training data. Use the following code to generate the ids for the testing data. Use your previous code to fit a linear regression model (predict price with age, distance and stores), and then calculate the prediction error on the testing data. Report your (mean) training error and testing (prediction) error: \[\begin{align} \text{Training Error} =& \frac{1}{n_\text{train}} \sum_{i \in \text{Train}} (y_i - \hat y_i)^2 \\ \text{Testing Error} =& \frac{1}{n_\text{test}} \sum_{i \in \text{Test}} (y_i - \hat y_i)^2 \end{align}\] Here \(y_i\) is the original \(y\) value and \(\hat y_i\) is the fitted (for training data) or predicted (for testing data) value. Which one do you expect to be larger, and why? After carrying out your analysis, does the result matches your expectation? If not, what could be the causes?
  # generate the indices for the testing data
  set.seed(432)
  test_idx = sample(nrow(realestate), 100)

Answer:

  # define training and testing data 
  y = realestate$price
  X = data.matrix(cbind(1, realestate[, c("age", "distance", "stores")]))

  set.seed(432)
  test_idx = sample(nrow(realestate), 100)

  trainx = X[ -test_idx, ]
  testx  = X[ test_idx, ]
  trainy = y[ -test_idx ]
  testy  = y[ test_idx ]

  # fit linear regression using my own code 
  b = solve(t(trainx) %*% trainx) %*% t(trainx) %*% trainy

  # predicting on the testing data 
  pred = testx %*% b

  # training error
  mean((trainy - trainx %*% b)^2)
## [1] 74.57346
  # testing error
  mean((testy - pred)^2)
## [1] 119.4458
  # the overall variance of y
  var(y)
## [1] 185.1365

The testing error is slightly higher. This is expected since the training error can always be affected by over-fitting.

  1. [10 points] Alternatively, you can always use built-in functions to fit linear regression. Setup your code to perform a step-wise linear regression using the step() function (using all covariates). Choose one among the AIC/BIC/Cp criterion to select the best model. For the step() function, you can use any configuration you like, such as direction etc. You should still use the same trainning and testing ids defined previously. Report your best model, trainning error and testing error.

Answer:

  realestate = read.csv("realestate.csv", row.names = 1)
  train = realestate[-test_idx, ]
  test  = realestate[test_idx, ]
  
  lm_fit = lm(price ~ ., data = train)
  lm_fit
## 
## Call:
## lm(formula = price ~ ., data = train)
## 
## Coefficients:
## (Intercept)         date          age     distance       stores     latitude  
##  -1.118e+04    4.438e+00   -3.098e-01   -4.889e-03    9.931e-01    2.140e+02  
##   longitude  
##  -2.509e+01
  # perform step wise regression using AIC
  step(lm_fit, direction = "both", trace = 0)
## 
## Call:
## lm(formula = price ~ date + age + distance + stores + latitude, 
##     data = train)
## 
## Coefficients:
## (Intercept)         date          age     distance       stores     latitude  
##  -1.439e+04    4.467e+00   -3.098e-01   -4.613e-03    9.964e-01    2.178e+02
  # the best model
  lm_fit = lm(price ~ date + age + distance + stores + latitude, data = train)

  # prediction error
  pred = predict(lm_fit, newdata = test)
  pred_err = mean((pred - test$price)^2)
  print(pred_err)
## [1] 106.2898
  # training error
  pred = predict(lm_fit, newdata = train)
  pred_err = mean((pred - train$price)^2)
  print(pred_err)
## [1] 68.52164

This model is slightly better than the previous one.

Question 3 (Optimizing a Univariate Function)

Consider minimizing the following function:

\[f(x) = \exp(x) - 2.5 \times (x + 6)^2 - 0.05 \times x^3\]

  1. [10 Points] Write a function f_obj(x) that calculates this objective function. Plot this function on the domain \(x \in [-40, 7]\).

Answer:

  f_obj <- function(x) exp(x) - 2.5*(x+6)^2 - 0.05*x^3
  x = seq(-40, 7, 0.01)
  plot(x, f_obj(x), type = "l", col = "darkorange", lwd = 2, ylab = "f(x)")

  1. [10 Points] Use the optim() function to solve this optimization problem. Use method = "BFGS". Try two initial points: -15 and 0. Report Are the solutions you obtained different? Why?

Answer:

  # perform BFGS with different intializations
  sol1 = optim(-15, fn = f_obj, method = "BFGS")
  print(c(sol1$par, sol1$value))
## [1]  -25.48584 -121.55652
  sol2 = optim(0, fn = f_obj, method = "BFGS")
  print(c(sol2$par, sol2$value))
## [1]    3.953841 -198.652634

The two solutions are different. One is at -25.4858375, and other one at 3.9538407. This is because BFGS method only converged to a local minimum. Only the latter one corresponds to a global minimum.