Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, there are two basic rules:
Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. Please refer to the course website for late submission policy and grading rubrics.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format will not be
accepted because they are often not readable on gradescope.
Make all of your R
code chunks visible for
grading.R
is \(\geq 4.0.0\). This
will ensure your random seed generation is the same as everyone
else..Rmd
file
as a template, be sure to remove this instruction
section.This question is about playing with AI tools for generating multivariate normal random variables. Let \(X_i\), \(i = 1, \ldots, n\) be i.i.d. multivariate normal random variables with mean \(\mu\) and covariance matrix \(\Sigma\), where
\[
\mu = \begin{bmatrix} 1 \\ 2 \end{bmatrix}, \quad \text{and} \quad
\Sigma = \begin{bmatrix} 1 & 0.5 \\ 0.5 & 1 \end{bmatrix}.
\] Write R
code to perform the following tasks.
Please try to use AI tools as much as possible in this question.
R
output. Make sure set random
seed \(=1\) in order to replicate the
result. Calculate the sample covariance matrix of the generated data and
compare it with the true covariance matrix \(\Sigma\). set.seed(1)
n <- 2000
mu <- c(1, 2)
sigma <- matrix(c(1, 0.5, 0.5, 1), nrow = 2, ncol = 2)
X <- MASS::mvrnorm(n, mu, sigma)
head(X, 5)
## [,1] [,2]
## [1,] 0.9005499 1.014400
## [2,] 2.1201672 1.197912
## [3,] -0.5335260 2.086175
## [4,] 2.1219187 3.641189
## [5,] 1.3132871 2.257437
cov(X)
## [,1] [,2]
## [1,] 1.0443799 0.5392157
## [2,] 0.5392157 1.1045078
mvrnorm
function from the MASS
package. However, there are
alternative ways to complete this question. For example, you could first
generate \(n\) standard normal random
variables, and then transform them to the desired distribution. Write
down the mathematical formula of this approach in Latex, and then write
the corresponding R
code to implement this approach. Only
display the first 5 observations in your R
output. Validate
your approach by computing the sample covariance matrix of the generated
data and compare it with the true covariance matrix \(\Sigma\). Please note that you
should not use the mvrnorm
function
anymore in this question. set.seed(1)
Z <- matrix(rnorm(n*2), n, 2)
X <- sweep(Z %*% chol(sigma), 2, mu, "+")
head(X, 5)
## [,1] [,2]
## [1,] 0.3735462 0.9193450
## [2,] 1.1836433 0.4271001
## [3,] 0.1643714 2.9848877
## [4,] 2.5952808 3.2473413
## [5,] 1.3295078 2.1163864
cov(X)
## [,1] [,2]
## [1,] 1.0757730 0.5679505
## [2,] 0.5679505 1.1018494
The logic of this approach can be explained by the following:
\[\begin{align*} \textbf{Given}: & \quad \mathbf{Z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ \textbf{Cholesky Decomposition}: & \quad \mathbf{L} = \text{chol}(\mathbf{\Sigma}) \,\, \text{i.e.,} \,\, \mathbf{L}\mathbf{L}^T = \mathbf{\Sigma} \\ \textbf{Transformation}: & \quad \mathbf{X} = \mathbf{L}\mathbf{Z} + \mathbf{\mu} \\ \textbf{Mean of } \mathbf{X}: & \quad \mathbb{E}[\mathbf{X}] = \mathbb{E}[\mathbf{L}\mathbf{Z} ] + \mathbf{\mu} = \mathbf{L} \mathbf{0} + \mathbf{\mu} = \mathbf{\mu} \\ \textbf{Covariance of } \mathbf{X}: & \quad \text{Cov}(\mathbf{X}) = \mathbf{L} \text{Cov}(\mathbf{Z}) \mathbf{L}^T = \mathbf{L} \mathbf{I} \mathbf{L}^T = \mathbf{\Sigma} \end{align*}\]
Since linear combinations of independent normal random variables are still jointly normal, then \(\mathbf{X} \sim \mathcal{N}(\mathbf{\mu}, \mathbf{\Sigma})\).
R
function called
mymvnorm
that takes the following arguments:
n
, mu
, sigma
. The function should
return a matrix of dimension \(n \times
p\), where \(p\) is the length
of mu
. The function should generate \(n\) observations from a multivariate normal
distribution with mean mu
and covariance matrix
sigma
. You should not use the mvrnorm
function
or any other similar built-in R functions in your code. Instead, use the
logic you wrote in part b) to generate the data. Again, validate your
result by calculating the sample covariance matrix of the generated data
and compare to \(\Sigma\). Also, when
setting seed correctly, your answer in this question should be identical
to the one in part b). mymvnorm <- function(n, m, S) {
p <- length(m)
Z <- matrix(rnorm(n*p), n, p)
X <- sweep(Z %*% chol(S), 2, m, "+")
return(X)
}
set.seed(1)
X <- mymvnorm(n, m = mu, S = sigma)
head(X, 5)
## [,1] [,2]
## [1,] 0.3735462 0.9193450
## [2,] 1.1836433 0.4271001
## [3,] 0.1643714 2.9848877
## [4,] 2.5952808 3.2473413
## [5,] 1.3295078 2.1163864
cov(X)
## [,1] [,2]
## [1,] 1.0757730 0.5679505
## [2,] 0.5679505 1.1018494
[5 points] Briefly comment on your usage of AI tools in the above questions.
[10 points] Try to create a question related to multivariate normal distribution that you think the AI is going to have difficulty answering. Write down the question and the answer you expect and got from AI. Were you able to trick the AI? If not, briefly discuss your experience.
Answer:
I was using Copilot to help me write the code. So I directly wrote a sentence as the following.
“The question I used is if \(X\) and \(Y\) are random variables, the mean of \(X\) is 1.9 and the mean of \(Y\) is 1.11, which variable has a larger mean.”
I then type ``The answer I got’’ in the next line, and Copilot immediately suggested the following answer.
``The answer I got from the AI is that it is not possible to determine which variable has a larger mean without additional information about the distributions of \(X\) and \(Y\).’’
Also my experience with Copilot in RStudio is that if you have a wrong code in the early part of your R script, it will more likely to suggest similar wrong answers in the later part. So I would suggest you to always check the code carefully.
The following question practices data manipulation, visualization and
linear regression. Load the quantmod
package and obtain the
AAPL
data (apple stock price).
library(quantmod)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: TTR
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
getSymbols("AAPL")
## [1] "AAPL"
plot(AAPL$AAPL.Close, pch = 19)
AAPL
and plot it on the same graph. Moving average means
that for each day, you take the average of the past 10 days (including
the current day). Please do this in two ways: 1) there is a built-in
function called SMA
in the quantmod
package;
2) write your own function to calculate the moving average. Plot and
also check if the two calculations are identical. For both questions,
you can utilize AI tools to help you write the code. AAPL$MA10 <- SMA(Cl(AAPL), 10)
plot(AAPL$AAPL.Close, pch = 19)
lines(AAPL$MA10, col = "red", lwd = 2)
moving_average <- function(x, window) {
# Compute the moving average of x with window size = window
n <- length(x)
ma <- rep(NA, n)
for (i in window:n) {
ma[i] <- mean(x[(i-window+1):i])
}
return(ma)
}
AAPL$MA10_2 <- moving_average(Cl(AAPL), 10)
plot(AAPL$AAPL.Close, pch = 19)
lines(AAPL$MA10_2, col = "red", lwd = 2)
#check if the two are the same
all.equal(AAPL$MA10, AAPL$MA10_2, check.attributes = FALSE)
## [1] TRUE
AAPL
of the next five days (not
including the current day) using two variables: the average of the past
10 days, and the average of the past 20 days. Provide summaries of the
regression results and comment on whether the information beyond the
past 10 days is useful.Answer:
This approach is suggested by copilot but is wrong. Because it includes the current day in the 5 day average.
To do this, we first need to calculate the 20-day moving average and the lead 5-day average.
AAPL$MA20 <- SMA(Cl(AAPL), 20)
AAPL$Lead5 <- zoo::rollapply(Cl(AAPL), width = 5, FUN = mean, align = "left", fill = NA)
# Remove rows with NA values
AAPL_clean <- na.omit(AAPL[, c("Lead5", "MA10", "MA20")])
# Fit the linear model
model <- lm(Lead5 ~ MA10 + MA20, data = AAPL_clean)
summary(model)
##
## Call:
## lm(formula = Lead5 ~ MA10 + MA20, data = AAPL_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.684 -0.600 -0.052 0.584 18.830
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.14380 0.06329 2.272 0.0231 *
## MA10 1.39765 0.02274 61.463 <2e-16 ***
## MA20 -0.39645 0.02281 -17.382 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.17 on 4672 degrees of freedom
## Multiple R-squared: 0.9979, Adjusted R-squared: 0.9979
## F-statistic: 1.126e+06 on 2 and 4672 DF, p-value: < 2.2e-16
This is my own code
moving_average_forward <- function(x, window) {
# Compute the moving average of x with window size = window
n <- length(x)
ma <- rep(NA, n)
for (i in 1:(n-window)) {
ma[i] <- mean(x[(i+1):(i+window)])
}
return(ma)
}
AAPL$MA20_2 <- moving_average(Cl(AAPL), 20)
AAPL$Lead5_2 <- moving_average_forward(Cl(AAPL), 5)
# Fit the linear model
model <- lm(Lead5_2 ~ MA10_2 + MA20_2, data = AAPL, na.action = na.omit)
summary(model)
##
## Call:
## lm(formula = Lead5_2 ~ MA10_2 + MA20_2, data = AAPL, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.330 -0.658 -0.064 0.651 20.523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.17541 0.07087 2.475 0.0134 *
## MA10_2 1.38555 0.02547 54.396 <2e-16 ***
## MA20_2 -0.38404 0.02555 -15.032 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.549 on 4671 degrees of freedom
## (24 observations deleted due to missingness)
## Multiple R-squared: 0.9974, Adjusted R-squared: 0.9974
## F-statistic: 8.978e+05 on 2 and 4671 DF, p-value: < 2.2e-16
To determine whether information beyond 10 days is useful, compare to
a model that does not use MA20_2
:
# Fit the linear model
model2 <- lm(Lead5_2 ~ MA10_2, data = AAPL[!(is.na(AAPL$MA20_2)),], na.action = na.omit)
AIC(model,model2)
Information beyond 10 days seem useful based on the AIC.
Examples of potential issues:
MA10_2
and MA20_2
are
highly correlated, which makes it difficult to determine the individual
effect of each predictor on the dependent variable. The sign and the
magnitude of the coefficient estimates may be misleading. (can check by
fitting a model with only MA20_2
as a explanatory variable
– the sign is flipped.)ElemStatLearn
package [CRAN
link] is an archived package. Hence, you cannot directly install it
using the install.packages()
function. Instead, you may
install an older version of it by using the
install_github()
function from the devtools
package. Install the devtools
package and run the find the
code to install the ElemStatLearn
package. library(devtools)
devtools::install_github("cran/ElemStatLearn")
# alternatively, you can do
install.packages("ElemStatLearn", version = "2015.6.26.2",
repos = "http://cran.us.r-project.org")
ElemStatLearn
package and obtain
the ozone
data. Save this data into a .csv
file, and then read the data back from that file into R
.
Print out the first 5 observations to make sure that the new data is the
same as the original one.We first get the ozone
data and save it in a
.csv
file.
library(ElemStatLearn)
data(ozone)
write.csv(ozone, file = "ozone.csv")
We then load the data back into R
:
ozone_new = read.csv("ozone.csv")
However, these two files are different because the
read.csv()
function will treat the first column
(observation IDs) as a new variable.
head(ozone_new)
To prevent this issue, we will specify the argument
row.names = 1
. This leads to the correct data.
ozone_new = read.csv("ozone.csv", row.names = 1)
head(ozone_new)
For this question, my AI tool (Copilot) did not suggest the correct
code since it did not realize that R will also read the row names as a
new variable. Hence, after doing some research, I found that adding
row.names = 1
solves the problem. This was the code
originally suggested by Copilot.
library(ElemStatLearn)
data(ozone)
write.csv(ozone, "ozone.csv")
ozone2 <- read.csv("ozone.csv")
head(ozone2, 5)