Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to reinstall all of your packages.We will be using the first 2500 observations of the MNIST dataset. You can use the following code, or the saved data from our previous homework.
# inputs to download file
fileLocation <- "https://pjreddie.com/media/files/mnist_train.csv"
numRowsToDownload <- 2500
localFileName <- paste0("mnist_first", numRowsToDownload, ".RData")
# download the data and add column names
mnist <- read.csv(fileLocation, nrows = numRowsToDownload)
numColsMnist <- dim(mnist)[2]
colnames(mnist) <- c("Digit", paste("Pixel", seq(1:(numColsMnist - 1)), sep = ""))
# save file
# in the future we can read in from the local copy instead of having to redownload
save(mnist, file = localFileName)
# you can load the data with the following code
load(file = localFileName)
[10 pts] Write you own code to fit a Linear Discriminant Analysis (LDA) model to the MNIST dataset. Use the first 1250 observations as the training set and the remaining observations as the test set. An issue with this dataset is that some pixels display little or no variation across all observations. This zero variance issue poses a problem when inverting the estimated covariance matrix. To address this issue, take digits 1, 7, and 9 from the training data, and perform a screening on the marginal variance of all 784 pixels. Take the top 300 pixels with the largest variance and use them to fit the LDA model. Remove the remaining ones from the training and test data.
[30 pts] Write your own code to implement the LDA model. Remember that LDA requires the estimation of several parameters: \(\Sigma\), \(\mu_k\), and \(\pi_k\). Estimate these parameters and calculate the decision scores \(\delta_k\) on the testing data to predict the class label. Report the accuracy and the confusion matrix based on the testing data.
[10 pts] Use the lda()
function from MASS package to
fit LDA. Report the accuracy and the confusion matrix based on the
testing data. Compare your results with part b.
[10 pts] Use the qda()
function from MASS package to
fit QDA. Does the code work directly? Why? If you are asked to modify
your own code to perform QDA, what would you do? Discuss this issue and
propose at least two solutions to address it. If relavent, provide
mathematical reasoning (in latex) of your solution. You do
not need to implement that with code.
Load data Carseats
from the ISLR
package.
Use the following code to define the training and test sets.
# load library
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.1
# load data
data(Carseats)
# set seed
set.seed(7)
# number of rows in entire dataset
n_Carseats <- dim(Carseats)[1]
# training set parameters
train_percentage <- 0.75
train_size <- floor(train_percentage*n_Carseats)
train_indices <- sample(x = 1:n_Carseats, size = train_size)
# separate dataset into train and test
train_Carseats <- Carseats[train_indices,]
test_Carseats <- Carseats[-train_indices,]
[20 pts] We seek to predict the variable Sales
using
a regression tree. Load the library rpart
. Fit a regression
tree to the training set using the rpart()
function, all
hyperparameter arguments should be left as default. Load the library
rpart.plot()
. Plot the tree using the prp()
function. Based on this model, what type of observations has the highest
or lowest sales? Predict using the tree onto the test set, calculate and
report the MSE on the testing data.
[20 pts] Set the seed to 7 at the beginning of the chunk and do this question in a single chunk so the seed doesn’t get switched. Find the largest complexity parameter value of the tree you grew in part a) that will ensure that the cross-validation error < min(cross-validation error) + cross-validation standard deviation. Print that complexity parameter value. Prune the tree using that value. Predict using the pruned tree onto the test set, calculate the test Mean-Squared Error, and print it.