STAT 542: Homework 3

Instruction
About HW3
Question 1 [50 Points] A Simulation Study
Question 2 [50 Points] Bitcoin price prediction

Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to compass2g. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

What is expected for the submission to Gradescope
- You are required to submit one rendered PDF file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file generated by a .Rmd file. .html format cannot be accepted.
- Please follow the instructions on Gradescope to select corresponding PDF pages for each question.
Please note that your homework file is a PDF report instead of a messy collection of R codes. This report should include:
- Your Name and NetID. (Replace Ruoqing Zhu(rqzhu) by your name and NetID if you are using this template).
- Make all of your R code chunks visible for grading.
- Relevant outputs from your R code chunks that support your answers.
- Provide clear answers or conclusions for each question. For example, you could start with Answer: I fit SVM with the following choice of tuning parameters ...
- Many assignments require your own implementation of algorithms. Basic comments are strongly encouraged to explain the logic to our graders. However, line-by-line code comments are unnecessary.
Requirements regarding the .Rmd file.
- You do NOT need to submit Rmd files. However, your PDF file should be rendered directly from it.
- Make sure that you set random seeds for simulation or randomized algorithms so that the results are reproducible. If a specific seed number is not provided in the homework, you can consider using your NetID.
- For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.

About HW3

In the first question, we will use a simulation study to confirm the theoretical analysis we developed during the lecture. In the second question, we will practice several linear model selection techniques such as AIC, BIC, and best subset selection. However, some difficulties are at the data processing part, in which we use the Bitcoin data from Kaggle. This is essentially a time-series dataset, and we use the information in previous days to predict the price in a future day. Make sure that you process the data correctly to fit this task.

Question 1 [50 Points] A Simulation Study

Let’s use a simulation study to confirm the bias-variance trade-off of linear regressions. Consider the following model.

\[Y = \sum_j^p 0.8^j \times X_j + \epsilon\] All the covariates and the error term follow i.i.d. standard Gaussian distribution. The true model involves all the variables; however, larger indexes do not significantly contribute to the variation. Hence, there could be a benefit in using a smaller subset for prediction purposes. Let’s confirm that with a simulation study.

Generate 100 samples of covariates \(X\) with \(p=30\) by the following code.

  set.seed(542)
  n = 100
  p = 30
  b = 0.8^(1:p)
  X = matrix(rnorm(n*p), n, p)
  Ytrue = X %*% b

Then the study essentially repeats the following steps 100 times. Begin with another fixed random seed before your loop.
- Using the fixed covariates \(X\), generate 100 training outcomes \(Y_\text{train}\) and 100 testing outcomes \(Y_\text{test}\) independently.
- Consider using only the first \(j\) variables to fit the linear regression (NO intercept term). Let \(j\) ranges from 1 to 30. Calculate and record the corresponding prediction error by comparing your prediction with the outcomes for testing data.

Without running the simulation, for each \(j\) value, we also have the theoretical decomposition of the testing error based on the lecture. Suppose you know the true model, covariates \(X\) and the distribution of random noise.

[15 pts] Please calculate the bias^2 , variance (of the prediction) and testing error for each \(j\) based on the theoretical formulas. Plot the 3 lines on the same figure, using the number of variables as the x-axis and bias^2, variance, theoretical testing error as the y-axis. Label each line.

\(\text{Bias}^2 = \frac{1}{n} \| E(Y_\text{pred}) - Y_\text{true} \|^2\), where \(Y\) is an \(n \times 1\) vector.
\(\text{Var} = \frac{1}{n} E \| Y_\text{pred} - E(Y_\text{pred}) \|^2\).

[5 pts] Report the theoretical testing error with \(p = 30\), \(\frac{1}{n}E \|Y_\text{test} - Y_\text{pred} \|^2\).

After finishing the simulation:

[20 pts] Perform the simulation. Report the averaged (empirical) prediction error with \(p = 30\). Note that 100 times simulation approximates the \(E\) operation. Plot pred err in the same figure of question a. Label your line. Does your empirical testing error match our theoretical analysis? Comment on your findings.
[10 pts] Evaluate the bias^2 for model \(p=5\) without theoretical formulas. You can still assume you know the true outcomes while using the average results to approximate the \(E\) operation. Compare the empirical value with the theoretical one.

Question 2 [50 Points] Bitcoin price prediction

For this question, we will use the Bitcoin data provided on the course website. The data were posted originally on Kaggle (link). Make sure that you read relevant information from the Kaggle website. Our data is the bitcoin_dataset.csv file. You should use a training/testing split such that your training data is constructed using only information up to 12/31/2016, and your testing data is constructed using only information starting from 01/01/2017. The goal of our analysis is to predict the btc_market_price. Since this is longitudinal data, we will use the information from previous days to predict the market price at a future day. In particular, on each calendar day (say, day 1), we use the information from three days onward (days 1, 2, and 3) to predict the market price on the 7th day.

Hence you need to first reconstruct the data properly to fit this purpose. This is mainly to put the outcome (of day 7) and the covariates (of the previous days) into the same row. Note that you may face missing data, categorical predictors, outliers, scaling issues, computational issues, and maybe others for this question. Use your best judgment to deal with them. There is no general ``best answer’’. Hence the grading will be based on whether you provided reasoning for your decision and whether you carried out the analysis correctly.

[25 Points] Data Construction. Data pre-processing is usually the most time-consuming and difficult part of an analysis. We will use this example as a practice. Construct your data appropriately such that further analysis can be performed. Make sure that you consider the following:
- The data is appropriate for our analysis goal: each row contains the outcome on the seventh day and the covariates based on the first three days. The covariates are not limited to the price.
- Missing data is addressed (you can remove variables, remove observations, impute values or propose your own method)
- You may process the covariates and/or outcome by considering centering, scaling, transformation, removing outliers, etc. However, these are your choice.

For each of the above tasks, make sure that you clearly document your choice. In the end, provide a summary table/figure of your data. You can consider using boxplots, quantiles, histograms, or any method that is easy for readers to understand. You are required to pick at least one method to present.

  # bitcoin = read.csv(file = "bitcoin.csv")
  # head(bitcoin)

[20 Points] Model Selection Criterion. Use AIC and BIC criteria to select the best model and report the result from each of them. Use the forward selection for AIC and backward selection for BIC. Report the following mean squared error from both training and testing data.
- The mean squared error: \(n^{-1} \sum_{i}(Y_i - \widehat{Y}_i)^2\)
- Since these quantities can be affected by scaling and transformation, make sure that you state any modifications applied to the outcome variable. Compare the training data errors and testing data errors. Which model works better? Provide a summary of your results.
[10 Points] Best Subset Selection. Fit the best subset selection to the dataset and report the best model of each model size (up to 7 variables, excluding the intercept) and their prediction errors. Make sure that you simplify your output to only present the essential information. If the algorithm cannot handle this many variables, then consider using just day 1 and 2 information. You can use leaps package for subset selection.

STAT 542: Homework 3

Spring 2022, by Ruoqing Zhu (rqzhu)

Due: Thur, Feb 10, 11:59 PM CT

Instruction

About HW3

Question 1 [50 Points] A Simulation Study

Question 2 [50 Points] Bitcoin price prediction