Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

You are required to submit the rendered file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file. .html format cannot be accepted. Make all of your R code chunks visible for grading.
Include your Name and NetID in the report.
If you use this file or the example homework .Rmd file as a template, be sure to remove this instruction section.
Make sure that you set seed properly so that the results can be replicated if needed.
For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify s uch restrictions, you can use anything.
When using AI tools, you are encouraged to document your comment on your experience with AI tools especially when it’s difficult for them to grasp the idea of the question.
On random seed and reproducibility: Make sure the version of your R is \(\geq 4.0.0\). This will ensure your random seed generation is the same as everyone else. Please note that updating the R version may require you to reinstall all of your packages.

Question 1: Discriminant Analysis (60 points)

We will be using the first 2500 observations of the MNIST dataset. You can use the following code, or the saved data from our previous homework.

  # inputs to download file
  fileLocation <- "https://pjreddie.com/media/files/mnist_train.csv"
  numRowsToDownload <- 2500
  localFileName <- paste0("mnist_first", numRowsToDownload, ".RData")

  # download the data and add column names
  mnist <- read.csv(fileLocation, nrows = numRowsToDownload)
  numColsMnist <- dim(mnist)[2]
  colnames(mnist) <- c("Digit", paste("Pixel", seq(1:(numColsMnist - 1)), sep = ""))

  # save file
  # in the future we can read in from the local copy instead of having to redownload
  save(mnist, file = localFileName)
  
  # you can load the data with the following code
  load(file = localFileName)

[10 pts] Write you own code to fit a Linear Discriminant Analysis (LDA) model to the MNIST dataset. Use the first 1250 observations as the training set and the remaining observations as the test set. An issue with this dataset is that some pixels display little or no variation across all observations. This zero variance issue poses a problem when inverting the estimated covariance matrix. To address this issue, take digits 1, 7, and 9 from the training data, and perform a screening on the marginal variance of all 784 pixels. Take the top 300 pixels with the largest variance and use them to fit the LDA model. Remove the remaining ones from the training and test data.
[30 pts] Write your own code to implement the LDA model. Remember that LDA requires the estimation of several parameters: \(\Sigma\), \(\mu_k\), and \(\pi_k\). Estimate these parameters and calculate the decision scores \(\delta_k\) on the testing data to predict the class label. Report the accuracy and the confusion matrix based on the testing data.
[10 pts] Use the lda() function from MASS package to fit LDA. Report the accuracy and the confusion matrix based on the testing data. Compare your results with part b.
[10 pts] Use the qda() function from MASS package to fit QDA. Does the code work directly? Why? If you are asked to modify your own code to perform QDA, what would you do? Discuss this issue and propose at least two solutions to address it. If relavent, provide mathematical reasoning (in latex) of your solution. You do not need to implement that with code.

Question 2: Regression Trees (40 points)

Load data Carseats from the ISLR package. Use the following code to define the training and test sets.

  # load library
  library(ISLR)

## Warning: package 'ISLR' was built under R version 4.4.1

  # load data
  data(Carseats)
  
  # set seed
  set.seed(7)
  
  # number of rows in entire dataset
  n_Carseats <- dim(Carseats)[1]
  
  # training set parameters
  train_percentage <- 0.75
  train_size <- floor(train_percentage*n_Carseats)
  train_indices <- sample(x = 1:n_Carseats, size = train_size)
  
  # separate dataset into train and test
  train_Carseats <- Carseats[train_indices,]
  test_Carseats <- Carseats[-train_indices,]

[20 pts] We seek to predict the variable Sales using a regression tree. Load the library rpart. Fit a regression tree to the training set using the rpart() function, all hyperparameter arguments should be left as default. Load the library rpart.plot(). Plot the tree using the prp() function. Based on this model, what type of observations has the highest or lowest sales? Predict using the tree onto the test set, calculate and report the MSE on the testing data.
[20 pts] Set the seed to 7 at the beginning of the chunk and do this question in a single chunk so the seed doesn’t get switched. Find the largest complexity parameter value of the tree you grew in part a) that will ensure that the cross-validation error < min(cross-validation error) + cross-validation standard deviation. Print that complexity parameter value. Prune the tree using that value. Predict using the pruned tree onto the test set, calculate the test Mean-Squared Error, and print it.

Stat 432 Homework 8

Assigned: Oct 14, 2024; Due: 11:59 PM CT, Oct 24, 2024

Instruction

Question 1: Discriminant Analysis (60 points)

Question 2: Regression Trees (40 points)