Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, there are two basic rules:

Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. Please refer to the course website for late submission policy and grading rubrics.

Question 1 (Multivariate Normal Distribution)

This question is about playing with AI tools for generating multivariate normal random variables. Let \(X_i\), \(i = 1, \ldots, n\) be i.i.d. multivariate normal random variables with mean \(\mu\) and covariance matrix \(\Sigma\), where

\[ \mu = \begin{bmatrix} 1 \\ 2 \end{bmatrix}, \quad \text{and} \quad \Sigma = \begin{bmatrix} 1 & 0.5 \\ 0.5 & 1 \end{bmatrix}. \] Write R code to perform the following tasks. Please try to use AI tools as much as possible in this question.

  1. [10 points] Generate a set of \(n = 2000\) observations from this distribution. Only display the first 5 observations in your R output. Make sure set random seed \(=1\) in order to replicate the result. Calculate the sample covariance matrix of the generated data and compare it with the true covariance matrix \(\Sigma\).
  set.seed(1)
  n <- 2000
  mu <- c(1, 2)
  sigma <- matrix(c(1, 0.5, 0.5, 1), nrow = 2, ncol = 2)
  X <- MASS::mvrnorm(n, mu, sigma)
  head(X, 5)
##            [,1]     [,2]
## [1,]  0.9005499 1.014400
## [2,]  2.1201672 1.197912
## [3,] -0.5335260 2.086175
## [4,]  2.1219187 3.641189
## [5,]  1.3132871 2.257437
  cov(X)
##           [,1]      [,2]
## [1,] 1.0443799 0.5392157
## [2,] 0.5392157 1.1045078
  1. [10 points] If you used any AI tools to perform the previous question, they will most likely suggest using the mvrnorm function from the MASS package. However, there are alternative ways to complete this question. For example, you could first generate \(n\) standard normal random variables, and then transform them to the desired distribution. Write down the mathematical formula of this approach in Latex, and then write the corresponding R code to implement this approach. Only display the first 5 observations in your R output. Validate your approach by computing the sample covariance matrix of the generated data and compare it with the true covariance matrix \(\Sigma\). Please note that you should not use the mvrnorm function anymore in this question.
  set.seed(1)
  Z <- matrix(rnorm(n*2), n, 2)
  X <- sweep(Z %*% chol(sigma), 2, mu, "+")
  head(X, 5)
##           [,1]      [,2]
## [1,] 0.3735462 0.9193450
## [2,] 1.1836433 0.4271001
## [3,] 0.1643714 2.9848877
## [4,] 2.5952808 3.2473413
## [5,] 1.3295078 2.1163864
  cov(X)
##           [,1]      [,2]
## [1,] 1.0757730 0.5679505
## [2,] 0.5679505 1.1018494

The logic of this approach can be explained by the following:

\[\begin{align*} \textbf{Given}: & \quad \mathbf{Z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ \textbf{Cholesky Decomposition}: & \quad \mathbf{L} = \text{chol}(\mathbf{\Sigma}) \,\, \text{i.e.,} \,\, \mathbf{L}\mathbf{L}^T = \mathbf{\Sigma} \\ \textbf{Transformation}: & \quad \mathbf{X} = \mathbf{L}\mathbf{Z} + \mathbf{\mu} \\ \textbf{Mean of } \mathbf{X}: & \quad \mathbb{E}[\mathbf{X}] = \mathbb{E}[\mathbf{L}\mathbf{Z} ] + \mathbf{\mu} = \mathbf{L} \mathbf{0} + \mathbf{\mu} = \mathbf{\mu} \\ \textbf{Covariance of } \mathbf{X}: & \quad \text{Cov}(\mathbf{X}) = \mathbf{L} \text{Cov}(\mathbf{Z}) \mathbf{L}^T = \mathbf{L} \mathbf{I} \mathbf{L}^T = \mathbf{\Sigma} \end{align*}\]

Since linear combinations of independent normal random variables are still jointly normal, then \(\mathbf{X} \sim \mathcal{N}(\mathbf{\mu}, \mathbf{\Sigma})\).

  1. [10 points] Write an R function called mymvnorm that takes the following arguments: n, mu, sigma. The function should return a matrix of dimension \(n \times p\), where \(p\) is the length of mu. The function should generate \(n\) observations from a multivariate normal distribution with mean mu and covariance matrix sigma. You should not use the mvrnorm function or any other similar built-in R functions in your code. Instead, use the logic you wrote in part b) to generate the data. Again, validate your result by calculating the sample covariance matrix of the generated data and compare to \(\Sigma\). Also, when setting seed correctly, your answer in this question should be identical to the one in part b).
  mymvnorm <- function(n, m, S) {
    p <- length(m)
    Z <- matrix(rnorm(n*p), n, p)
    X <- sweep(Z %*% chol(S), 2, m, "+")
    return(X)
  }

  set.seed(1)
  X <- mymvnorm(n, m = mu, S = sigma)
  head(X, 5)
##           [,1]      [,2]
## [1,] 0.3735462 0.9193450
## [2,] 1.1836433 0.4271001
## [3,] 0.1643714 2.9848877
## [4,] 2.5952808 3.2473413
## [5,] 1.3295078 2.1163864
  cov(X)
##           [,1]      [,2]
## [1,] 1.0757730 0.5679505
## [2,] 0.5679505 1.1018494
  1. [5 points] Briefly comment on your usage of AI tools in the above questions.

    • Did you use any AI tools? If so, which ones and to what extend? Make sure to include this information at the beginning of your future homework as well.
    • Does your AI tool(s) immediately provide you the answers you needed? If not, briefly explain how did you modify the prompt or the code to get the desired result. If you did not use any AI tools, please state that explicitly.
  2. [10 points] Try to create a question related to multivariate normal distribution that you think the AI is going to have difficulty answering. Write down the question and the answer you expect and got from AI. Were you able to trick the AI? If not, briefly discuss your experience.

Answer:

I was using Copilot to help me write the code. So I directly wrote a sentence as the following.

“The question I used is if \(X\) and \(Y\) are random variables, the mean of \(X\) is 1.9 and the mean of \(Y\) is 1.11, which variable has a larger mean.”

I then type ``The answer I got’’ in the next line, and Copilot immediately suggested the following answer.

``The answer I got from the AI is that it is not possible to determine which variable has a larger mean without additional information about the distributions of \(X\) and \(Y\).’’

Also my experience with Copilot in RStudio is that if you have a wrong code in the early part of your R script, it will more likely to suggest similar wrong answers in the later part. So I would suggest you to always check the code carefully.

Question 2 (Data Manipulation and Plots)

The following question practices data manipulation, visualization and linear regression. Load the quantmod package and obtain the AAPL data (apple stock price).

  library(quantmod)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: TTR
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
  getSymbols("AAPL")
## [1] "AAPL"
  plot(AAPL$AAPL.Close, pch = 19)

  1. [10 points] Calculate a 10-day moving average of the closing price of AAPL and plot it on the same graph. Moving average means that for each day, you take the average of the past 10 days (including the current day). Please do this in two ways: 1) there is a built-in function called SMA in the quantmod package; 2) write your own function to calculate the moving average. Plot and also check if the two calculations are identical. For both questions, you can utilize AI tools to help you write the code.
  AAPL$MA10 <- SMA(Cl(AAPL), 10)
  plot(AAPL$AAPL.Close, pch = 19)

  lines(AAPL$MA10, col = "red", lwd = 2)

  moving_average <- function(x, window) {
    # Compute the moving average of x with window size = window
    n <- length(x)
    ma <- rep(NA, n)
    for (i in window:n) {
      ma[i] <- mean(x[(i-window+1):i])
    }
    return(ma)
  }

  AAPL$MA10_2 <- moving_average(Cl(AAPL), 10)
  plot(AAPL$AAPL.Close, pch = 19)

  lines(AAPL$MA10_2, col = "red", lwd = 2)

  #check if the two are the same
  all.equal(AAPL$MA10, AAPL$MA10_2, check.attributes = FALSE)
## [1] TRUE
  1. [10 points] Let’s do a simple linear regression that predicts the average closing price of AAPL of the next five days (not including the current day) using two variables: the average of the past 10 days, and the average of the past 20 days. Provide summaries of the regression results and comment on whether the information beyond the past 10 days is useful.

Answer:

This approach is suggested by copilot but is wrong. Because it includes the current day in the 5 day average.

To do this, we first need to calculate the 20-day moving average and the lead 5-day average.

  AAPL$MA20 <- SMA(Cl(AAPL), 20)
  AAPL$Lead5 <- zoo::rollapply(Cl(AAPL), width = 5, FUN = mean, align = "left", fill = NA)
  
  # Remove rows with NA values
  AAPL_clean <- na.omit(AAPL[, c("Lead5", "MA10", "MA20")])
  
  # Fit the linear model
  model <- lm(Lead5 ~ MA10 + MA20, data = AAPL_clean)
  summary(model)
## 
## Call:
## lm(formula = Lead5 ~ MA10 + MA20, data = AAPL_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.684  -0.600  -0.052   0.584  18.830 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.14380    0.06329   2.272   0.0231 *  
## MA10         1.39765    0.02274  61.463   <2e-16 ***
## MA20        -0.39645    0.02281 -17.382   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.17 on 4672 degrees of freedom
## Multiple R-squared:  0.9979, Adjusted R-squared:  0.9979 
## F-statistic: 1.126e+06 on 2 and 4672 DF,  p-value: < 2.2e-16

This is my own code

  moving_average_forward <- function(x, window) {
    # Compute the moving average of x with window size = window
    n <- length(x)
    ma <- rep(NA, n)
    for (i in 1:(n-window)) {
      ma[i] <- mean(x[(i+1):(i+window)])
    }
    return(ma)
  }

  AAPL$MA20_2 <- moving_average(Cl(AAPL), 20)
  AAPL$Lead5_2 <- moving_average_forward(Cl(AAPL), 5)
  
  # Fit the linear model
  model <- lm(Lead5_2 ~ MA10_2 + MA20_2, data = AAPL, na.action = na.omit)
  summary(model)  
## 
## Call:
## lm(formula = Lead5_2 ~ MA10_2 + MA20_2, data = AAPL, na.action = na.omit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.330  -0.658  -0.064   0.651  20.523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.17541    0.07087   2.475   0.0134 *  
## MA10_2       1.38555    0.02547  54.396   <2e-16 ***
## MA20_2      -0.38404    0.02555 -15.032   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.549 on 4671 degrees of freedom
##   (24 observations deleted due to missingness)
## Multiple R-squared:  0.9974, Adjusted R-squared:  0.9974 
## F-statistic: 8.978e+05 on 2 and 4671 DF,  p-value: < 2.2e-16

To determine whether information beyond 10 days is useful, compare to a model that does not use MA20_2:

  # Fit the linear model
  model2 <- lm(Lead5_2 ~ MA10_2, data = AAPL[!(is.na(AAPL$MA20_2)),], na.action = na.omit)
  AIC(model,model2) 

Information beyond 10 days seem useful based on the AIC.

  1. [10 points] This model fitting is too simple. What are the potential issues of this model that could make the results unreliable? Briefly discuss two of your findings and also search the literature and provide a reference to support your findings.

Examples of potential issues:

  1. Multicolinearity: MA10_2 and MA20_2 are highly correlated, which makes it difficult to determine the individual effect of each predictor on the dependent variable. The sign and the magnitude of the coefficient estimates may be misleading. (can check by fitting a model with only MA20_2 as a explanatory variable – the sign is flipped.)
  2. Violation of OLS assumptions like independence of residuals and homoscedasticity: OLS doesn’t account for autocorrelation or non-stationarity that may be present in time series data.

Question 3 (Read/write Data)

  1. [10 points] The ElemStatLearn package [CRAN link] is an archived package. Hence, you cannot directly install it using the install.packages() function. Instead, you may install an older version of it by using the install_github() function from the devtools package. Install the devtools package and run the find the code to install the ElemStatLearn package.
  library(devtools)
  devtools::install_github("cran/ElemStatLearn")
  
  # alternatively, you can do 
  install.packages("ElemStatLearn", version = "2015.6.26.2", 
                   repos = "http://cran.us.r-project.org")
  1. [15 Points] Load the ElemStatLearn package and obtain the ozone data. Save this data into a .csv file, and then read the data back from that file into R. Print out the first 5 observations to make sure that the new data is the same as the original one.

We first get the ozone data and save it in a .csv file.

  library(ElemStatLearn)
  data(ozone)
  write.csv(ozone, file = "ozone.csv")

We then load the data back into R:

  ozone_new = read.csv("ozone.csv")

However, these two files are different because the read.csv() function will treat the first column (observation IDs) as a new variable.

  head(ozone_new)

To prevent this issue, we will specify the argument row.names = 1. This leads to the correct data.

  ozone_new = read.csv("ozone.csv", row.names = 1)
  head(ozone_new)

For this question, my AI tool (Copilot) did not suggest the correct code since it did not realize that R will also read the row names as a new variable. Hence, after doing some research, I found that adding row.names = 1 solves the problem. This was the code originally suggested by Copilot.

  library(ElemStatLearn)
  data(ozone)
  write.csv(ozone, "ozone.csv")
  ozone2 <- read.csv("ozone.csv")
  head(ozone2, 5)