In the first question, we will use a simulation study to confirm the theoretical analysis we developed during the lecture. In the second question, we will practice several linear model selection techniques such as AIC, BIC, and best subset selection. However, some difficulties are at the data processing part, in which we use the Bitcoin data from Kaggle. This is essentially a time-series data, and we use the information in previous days to predict the price in a future day. Make sure that you process the data correctly to fit this task.
Let’s use a simulation study to confirm the bias-variance trade-off of linear regressions. Consider the following model.
\[Y = \sum_j^p 0.8^j \times X_j + \epsilon\] All the covariates and the error term follow i.i.d. standard Gaussian distribution. The true model involves all the variables; however, variables with larger indexes do not contribute significantly to the variation. Hence, there could be a benefit using a smaller subset for prediction purposes. Let’s confirm that with a simulation study.
set.seed(542)
n = 100
p = 30
b = 0.8^(1:p)
X = matrix(rnorm(n*p), n, p)
Ytrue = X %*% b
Without running the simulation, for each \(j\) value, we also have the theoretical decomposition of the testing error based on the lecture. Suppose you know the true model, covariates \(X\) and the distribution of random noise.
number of variables
as the x-axis and bias^2
, variance
, theorical testing error
as the y-axis. Label each line.After finishing the simulation:
pred err
in the same figure of question a. Label your line. Does your empirical testing error match our theoretical analysis? Comment on your findings.Code
# Theoretical bias and variance
sigma = 1
varModel =sigma^2 * (1:p) / n
irreduceErr = sigma^2
bias2Model = rep(0, p)
for (j in 1:p){
xsub = as.matrix(X[, 1:j])
H = xsub %*% solve(t(xsub) %*% xsub) %*% t(xsub)
bias2Model[j] = mean((Ytrue - H %*% Ytrue)^2)
}
# Simulation
set.seed(1)
N.sim = 100
errMat = matrix(0, N.sim, p)
predPerIter = matrix(0, N.sim, n)
for (k in 1:N.sim){
Ytrain = Ytrue + rnorm(n)
Ytest = Ytrue + rnorm(n)
train = data.frame(Ytrain, X)
test = data.frame(Ytest, X)
for (j in 1:p){
# Use a subset of X for training
fit = lm(Ytrain ~ .-1, data = train[, 1:(1+j)])
pred = predict(fit, test)
# testing error
errMat[k, j] = mean((pred - Ytest)^2)
# prediction vector with p = 5.
if (j == 5){
predPerIter[k, ] = pred
}
}
}
# calculate empirical prediction error
predErr = colMeans(errMat)
# calculate empirical bias
predAvg = colMeans(predPerIter)
bias2Emprical = mean((predAvg - Ytrue)^2)
plot(1:p, predErr, type = "l", ylim = c(0, 2.5), lty = 2, lwd = 2)
lines(1:p, bias2Model, col = "red", lwd = 2)
lines(1:p, varModel, col = "blue", lwd = 2)
lines(1:p, bias2Model + varModel + irreduceErr, col = "green", lty = 2, lwd = 3)
legend(x = 15, y = 2.5, legend = c("pred err", "bias^2", "variance", "theorical test err"), col = c('black', "red", "blue", "green"), lty = c(2, 1, 1, 3))
Anwser
See the plot.
(bias2Model + varModel + irreduceErr)[30]
## [1] 1.3
The theoretical testing error is 1.3.
sprintf("Given p=5, the emprical prediction error is %.3f", predErr[5])
## [1] "Given p=5, the emprical prediction error is 1.214"
The empirical prediction/testing error is 1.214. See the plot. The black line (pred err) matches the green line (theoretical test err). (The difference comes from the simulation errors). This is expected.
sprintf("Fixing p = 5, theoritical bias^2 is %.3f; empirical bias^2 is %.3f",
bias2Model[5], bias2Emprical)
## [1] "Fixing p = 5, theoritical bias^2 is 0.189; empirical bias^2 is 0.190"
Our theoretical bias^2 is 0.189 and empirical bias is 0.190. They are matched.
Remark: we use predPerIter[k, ] = pred
to record the prediction in each iteration. To calculate the bias, we first average all the predictions over 100 simulation and then evaluate its deviation from the true model.
For this question, we will use the Bitcoin data provided on the course website. The data were posted originally on Kaggle (link). Make sure that you read relevant information from the Kaggle website. Our data is the bitcoin_dataset.csv
file. You should use a training/testing split such that your training data is constructed using only information up to 12/31/2016, and your testing data is constructed using only information starting from 01/01/2017. The goal of our analysis is to predict the btc_market_price
. Since this is longitudinal data, we will use the information from previous days to predict the market price at a future day. In particular, on each calendar day (say, day 1), we use the information from three days onward (days 1, 2, and 3) to predict the market price on the 7th day.
Hence you need to first reconstruct the data properly to fit this purpose. This is mainly to put the outcome (of day 7) and the covariates (of the previous days) into the same row. Note that for this question, you may face issues such as missing data, categorical predictors, outliers, scaling issue, computational issue, and maybe others. Use your best judgment to deal with them. There is no general ``best answer’’. Hence the grading will be based on whether you provided reasoning for your decision and whether you carried out the analysis correctly.
For each of the above tasks, make sure that you clearly document your choice. In the end, provide a summary table/figure of your data. You can consider using boxplots, quantiles, histogram, or any method that is easy for readers to understand. You are required to pick one at least one method to present.
bitcoin = read.csv(file = "../data/bitcoin.csv")
head(bitcoin)
library(tidyverse)
colnames(bitcoin) = str_remove(colnames(bitcoin), "btc_")
bitcoin$difficulty = log(1+bitcoin$difficulty)
bitcoin$market_cap = log(1+bitcoin$market_cap)
bitcoin$estimated_transaction_volume_usd = log(1+bitcoin$estimated_transaction_volume_usd)
bitcoin$trade_volume = log(1+bitcoin$trade_volume)
bitcoin$market_price = log(1 + bitcoin$market_price)
library(zoo)
bitcoin$trade_volume = na.locf(bitcoin$trade_volume)
bitcoin_data = data.frame(scale(bitcoin[, -1]))
op <- par(mar = c(5, 10, 4, 2))
boxplot(bitcoin_data[, -1], horizontal = TRUE, las = 1)
bitcoin_pre3 = rbind(NA, NA, bitcoin_data)
colnames(bitcoin_pre3) = paste("pre3_", colnames(bitcoin_pre3), sep = "")
bitcoin_pre2 = rbind(NA, bitcoin_data, NA)
colnames(bitcoin_pre2) = paste("pre2_", colnames(bitcoin_pre2), sep = "")
bitcoin_pre1 = rbind(bitcoin_data, NA, NA)
colnames(bitcoin_pre1) = paste("pre1_", colnames(bitcoin_pre1), sep = "")
market_price = c(bitcoin$market_price[7:nrow(bitcoin)], rep(NA, 8))
bitcoin_meta = cbind(market_price, bitcoin_pre1, bitcoin_pre2, bitcoin_pre3)
train = bitcoin_meta[3:(2504-6), ]
test = bitcoin_meta[2504:(2922-8), ]
Remark:
Any reasonable transformation is acceptable. One may also construct some new variables such as averaging some variables among three days.
For some expoential growth variables / responses, one possible transformation is ’ taking logarithm’. To model \(\log y\), it is similar to modele the growth rate.
[20 Points] Model Selection Criterion. Use AIC and BIC criteria to select the best model and report the result from each of them. Use the forward selection for AIC and backward selection for BIC. Report the following mean squared error from both training and testing data.
lmfit = lm(market_price ~ ., data = train)
mseTrain = mean(lmfit$residuals^2)
MseTest = mean((test$market_price - predict(lmfit, newdata = test))^2)
# AIC with forward selection
AICfit = step(lm(market_price~1, data=train), scope=list(upper=lmfit, lower=~1), direction="forward", k = 2, trace=0)
mseAICtrain = mean(AICfit$residuals^2)
AIC_pred = predict(AICfit, test)
mseAICtest = mean((test$market_price - AIC_pred)^2)
# BIC with backward selection
BICfit = step(lmfit, scope=list(upper=lmfit, lower=~1), direction="backward", k = log(nrow(train)), trace=0)
mseBICtrain = mean(BICfit$residuals^2)
BIC_pred = predict(BICfit, test)
mseBICtest = mean((test$market_price - BIC_pred)^2)
sprintf("AIC: train MSE %.3f, test MSE %.3f; BIC: train MSE %.3f, test MSE %.3f",
mseAICtrain, mseAICtest, mseBICtrain, mseBICtest)
## [1] "AIC: train MSE 0.014, test MSE 0.038; BIC: train MSE 0.014, test MSE 0.301"
leaps
package for subset selection. library(leaps)
RSSleaps = regsubsets(as.matrix(train[,c(2:46)]), train[,1], nvmax=7)
## Reordering variables and trying again:
BSS = t(summary(RSSleaps, matrix=T)$which)
BSS[rowSums(BSS)> 0, ]
## 1 2 3 4 5 6 7 8
## (Intercept) TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## pre1_market_price TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## pre1_trade_volume FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## pre1_n_transactions_per_block FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## pre1_n_transactions_total FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
## pre2_market_price FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## pre2_blocks_size FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
## pre2_hash_rate FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## pre2_difficulty FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## pre2_transaction_fees FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
## pre2_n_transactions_total FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
As one example, we use the transformed covariates from first 2 days to perform best subset selection. The best models with 1 to 7 variables are listed as above.