Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, there are two basic rules:

Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. Please refer to the course website for late submission policy and grading rubrics.

Question 1 (Multivariate Normal Distribution)

This question is about playing with AI tools for generating multivariate normal random variables. Let \(X_i\), \(i = 1, \ldots, n\) be i.i.d. multivariate normal random variables with mean \(\mu\) and covariance matrix \(\Sigma\), where

\[ \mu = \begin{bmatrix} 1 \\ 2 \end{bmatrix}, \quad \text{and} \quad \Sigma = \begin{bmatrix} 1 & 0.5 \\ 0.5 & 1 \end{bmatrix}. \] Write R code to perform the following tasks. Please try to use AI tools as much as possible in this question.

  1. [10 points] Generate a set of \(n = 2000\) observations from this distribution. Only display the first 5 observations in your R output. Make sure set random seed \(=1\) in order to replicate the result. Calculate the sample covariance matrix of the generated data and compare it with the true covariance matrix \(\Sigma\).

  2. [10 points] If you used any AI tools to perform the previous question, they will most likely suggest using the mvrnorm function from the MASS package. However, there are alternative ways to complete this question. For example, you could first generate \(n\) standard normal random variables, and then transform them to the desired distribution. Write down the mathematical formula of this approach in Latex, and then write the corresponding R code to implement this approach. Only display the first 5 observations in your R output. Validate your approach by computing the sample covariance matrix of the generated data and compare it with the true covariance matrix \(\Sigma\). Please note that you should not use the mvrnorm function anymore in this question.

  3. [10 points] Write an R function called mymvnorm that takes the following arguments: n, mu, sigma. The function should return a matrix of dimension \(n \times p\), where \(p\) is the length of mu. The function should generate \(n\) observations from a multivariate normal distribution with mean mu and covariance matrix sigma. You should not use the mvrnorm function or any other similar built-in R functions in your code. Instead, use the logic you wrote in part b) to generate the data. Again, validate your result by calculating the sample covariance matrix of the generated data and compare to \(\Sigma\). Also, when setting seed correctly, your answer in this question should be identical to the one in part b).

  4. [5 points] Briefly comment on your usage of AI tools in the above questions.

    • Did you use any AI tools? If so, which ones and to what extend? Make sure to include this information at the beginning of your future homework as well.
    • Does your AI tool(s) immediately provide you the answers you needed? If not, briefly explain how did you modify the prompt or the code to get the desired result. If you did not use any AI tools, please state that explicitly.
  5. [5 points] Try to create a question related to multivariate normal distribution that you think the AI is going to have difficulty answering. Write down the question and the answer you expect and got from AI. Were you able to trick the AI? If not, briefly discuss your experience.

Question 2 (Data Manipulation and Plots)

The following question practices data manipulation, visualization and linear regression. Load the quantmod package and obtain the AAPL data (apple stock price).

  library(quantmod)
  getSymbols("AAPL")
## [1] "AAPL"
  plot(AAPL$AAPL.Close, pch = 19)

  1. [15 points] Calculate a 10-day moving average of the closing price of AAPL and plot it on the same graph. Moving average means that for each day, you take the average of the past 10 days (including the current day). Please do this in two ways: 1) there is a built-in function called SMA in the quantmod package; 2) write your own function to calculate the moving average. Plot and also check if the two calculations are identical. For both questions, you can utilize AI tools to help you write the code.

  2. [15 points] Let’s do a simple linear regression that predicts the average closing price of AAPL of the next five days (not including the current day) using two variables: the average of the past 10 days, and the average of the past 20 days. Provide summaries of the regression results and comment on whether the information beyond the past 10 days is useful.

  3. [10 points] This model fitting is too simple. What are the potential issues of this model that could make the results unreliable? Briefly discuss two your findings and also search the literature and provide a reference to support your findings.

Question 3 (Read/write Data)

  1. [10 points] The ElemStatLearn package [CRAN link] is an archived package. Hence, you cannot directly install it using the install.packages() function. Instead, you may install an older version of it by using the install_github() function from the devtools package. Install the devtools package and run the find the code to install the ElemStatLearn package.

  2. [10 Points] Load the ElemStatLearn package and obtain the ozone data. Save this data into a .csv file, and then read the data back from that file into R. Print out the first 5 observations to make sure that the new data is the same as the original one.