Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: Discriminant Analysis (60 points)

We will be using the first 2500 observations of the MNIST dataset. You can use the following code, or the saved data from our previous homework.

  # inputs to download file
  fileLocation <- "https://pjreddie.com/media/files/mnist_train.csv"
  numRowsToDownload <- 2500
  localFileName <- paste0("mnist_first", numRowsToDownload, ".RData")

  # download the data and add column names
  mnist <- read.csv(fileLocation, nrows = numRowsToDownload)
  numColsMnist <- dim(mnist)[2]
  colnames(mnist) <- c("Digit", paste("Pixel", seq(1:(numColsMnist - 1)), sep = ""))

  # save file
  # in the future we can read in from the local copy instead of having to redownload
  save(mnist, file = localFileName)
  
  # you can load the data with the following code
  load(file = localFileName)
  1. [10 pts] Write you own code to fit a Linear Discriminant Analysis (LDA) model to the MNIST dataset. Use the first 1250 observations as the training set and the remaining observations as the test set. An issue with this dataset is that some pixels display little or no variation across all observations. This zero variance issue poses a problem when inverting the estimated covariance matrix. To address this issue, take digits 1, 7, and 9 from the training data, and perform a screening on the marginal variance of all 784 pixels. Take the top 300 pixels with the largest variance and use them to fit the LDA model. Remove the remaining ones from the training and test data.

  2. [30 pts] Write your own code to implement the LDA model. Remember that LDA requires the estimation of several parameters: \(\Sigma\), \(\mu_k\), and \(\pi_k\). Estimate these parameters and calculate the decision scores \(\delta_k\) on the testing data to predict the class label. Report the accuracy and the confusion matrix based on the testing data.

  3. [10 pts] Use the lda() function from MASS package to fit LDA. Report the accuracy and the confusion matrix based on the testing data. Compare your results with part b.

  4. [10 pts] Use the qda() function from MASS package to fit QDA. Does the code work directly? Why? If you are asked to modify your own code to perform QDA, what would you do? Discuss this issue and propose at least two solutions to address it. If relavent, provide mathematical reasoning (in latex) of your solution. You do not need to implement that with code.

Question 2: Regression Trees (40 points)

Load data Carseats from the ISLR package. Use the following code to define the training and test sets.

  # load library
  library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.1
  # load data
  data(Carseats)
  
  # set seed
  set.seed(7)
  
  # number of rows in entire dataset
  n_Carseats <- dim(Carseats)[1]
  
  # training set parameters
  train_percentage <- 0.75
  train_size <- floor(train_percentage*n_Carseats)
  train_indices <- sample(x = 1:n_Carseats, size = train_size)
  
  # separate dataset into train and test
  train_Carseats <- Carseats[train_indices,]
  test_Carseats <- Carseats[-train_indices,]
  1. [20 pts] We seek to predict the variable Sales using a regression tree. Load the library rpart. Fit a regression tree to the training set using the rpart() function, all hyperparameter arguments should be left as default. Load the library rpart.plot(). Plot the tree using the prp() function. Based on this model, what type of observations has the highest or lowest sales? Predict using the tree onto the test set, calculate and report the MSE on the testing data.

  2. [20 pts] Set the seed to 7 at the beginning of the chunk and do this question in a single chunk so the seed doesn’t get switched. Find the largest complexity parameter value of the tree you grew in part a) that will ensure that the cross-validation error < min(cross-validation error) + cross-validation standard deviation. Print that complexity parameter value. Prune the tree using that value. Predict using the pruned tree onto the test set, calculate the test Mean-Squared Error, and print it.