Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, there are two basic rules:

Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. Please refer to the course website for late submission policy and grading rubrics.

Question 1 [65 pts]: Linear Discriminant Analysis

Load the same MNIST dataset from HW5, the same way as previous HW.

  load("mnist_first2000.RData")
  dim(mnist)
## [1] 2000  785
  1. [25 pts] We aim to fit an LDA (Linear Discriminant Analysis) model with our own defined function following our understanding of the LDA. An issue with this dataset, as we saw earlier, is that some pixels display little or no variation across all observations. This zero variance issue poses a problem when inverting the estimated covariance matrix. Do the following to address this issue and fit the LDA model.

    • Within the first 1000 observations, extract digts 1, 4, and 9 as the training dataset. Do the same for the second 1000 observations as the testing dataset.
    • To remove variables (pixels) with low variance, do the same procedure as we did before to select the top 300 Pixels with the highest variance.
    • Following our lecture note, the LDA model requires the estimation of several quantities, \(\Sigma\), \(\mu_k\) and \(\pi_k\). Perform these estimations and calculate the linear decision score \(x^T w_k + b_k\).
    • Use the linear decision score to predict the class label on the testing data. Report the prediction error and the confusion matrix based on the testing data.

Answer:

  # Part (a)
  # your code here
  train <- mnist[1:1000,]
  train <- train[train$Digit == 1 | train$Digit == 4 | train$Digit == 9,]
  test <- mnist[1001:2000,]
  test <- test[test$Digit == 1 | test$Digit == 4 | test$Digit == 9,]
  
  # perform marginal screening
  var = apply(train[, -1], 2, var)
  varuse = order(var, decreasing = TRUE)[1:300]
  train = train[, c(1, varuse + 1)]
  test = test[, c(1, varuse + 1)]
  
  # estimate the parameters
  # separate datasets
  digit1 = train[train$Digit == 1, -1]
  digit4 = train[train$Digit == 4, -1]
  digit9 = train[train$Digit == 9, -1]
  
  # prior
  pik = c(nrow(digit1), nrow(digit4), nrow(digit9)) / nrow(train)
  
  # mu
  mu1 = colMeans(digit1)
  mu4 = colMeans(digit4)
  mu9 = colMeans(digit9)
  
  # the centered data
  C1_centered = scale(digit1, center = TRUE, scale = FALSE)
  C4_centered = scale(digit4, center = TRUE, scale = FALSE)
  C9_centered = scale(digit9, center = TRUE, scale = FALSE)
  
  # pooled covariance matrix
  Sigma = (t(C1_centered) %*% C1_centered + 
           t(C4_centered) %*% C4_centered + 
           t(C9_centered) %*% C9_centered) / (nrow(train) - 3)
  
  # calculate wk
  w1 = solve(Sigma) %*% mu1
  w4 = solve(Sigma) %*% mu4
  w9 = solve(Sigma) %*% mu9
  
  # calculating bk
  b1 = - 0.5 * t(mu1) %*% solve(Sigma) %*% mu1 + log(pik[1])
  b4 = - 0.5 * t(mu4) %*% solve(Sigma) %*% mu4 + log(pik[2])
  b9 = - 0.5 * t(mu9) %*% solve(Sigma) %*% mu9 + log(pik[3])
  
  # predicting new data 
  # calculate and compare the posteriori probability 
  f1 = as.matrix(test[, -1]) %*% w1 + as.numeric(b1)
  f4 = as.matrix(test[, -1]) %*% w4 + as.numeric(b4)
  f9 = as.matrix(test[, -1]) %*% w9 + as.numeric(b9)
  
  # accuracy
  pred = c(1, 4, 9)[apply(cbind(f1, f4, f9), 1, which.max)]
  conf_tab = table(pred, test[, 1])
  conf_tab
##     
## pred  1  4  9
##    1 93 22 16
##    4  7 61 35
##    9  4 26 59
  1. [20 pts] The result may not be ideal. At least compared with our previous HW using SVM one-vs-one model, this is probably worse. Let’s try to improve it. One issue could be that the inverse of the covariance matrix is not very stable. As we discussed in class, one possible choice is to add a ridge penalty to \(\Sigma\). Carry out this approach using \(\lambda = 1\) and re-calculate the confusion matrix and prediction error. Then try a few different penalty values of \(\lambda\) to observe how the prediction error changes. Comment on the effect of \(\lambda\), specifically under the context of this model. Is this related to the bias-variance trade-off?

Answer:

  # Part (b): lambda = 1
  Sigma_pen = Sigma + diag(nrow(Sigma)) * 1
  
  # calculate wk
  w1 = solve(Sigma_pen) %*% mu1
  w4 = solve(Sigma_pen) %*% mu4
  w9 = solve(Sigma_pen) %*% mu9
  
  # calculating bk
  b1 = - 0.5 * t(mu1) %*% solve(Sigma_pen) %*% mu1 + log(pik[1])
  b4 = - 0.5 * t(mu4) %*% solve(Sigma_pen) %*% mu4 + log(pik[2])
  b9 = - 0.5 * t(mu9) %*% solve(Sigma_pen) %*% mu9 + log(pik[3])
  
  # predicting new data 
  # calculate and compare the posteriori probability 
  f1 = as.matrix(test[, -1]) %*% w1 + as.numeric(b1)
  f4 = as.matrix(test[, -1]) %*% w4 + as.numeric(b4)
  f9 = as.matrix(test[, -1]) %*% w9 + as.numeric(b9)
  
  # accuracy
  pred = c(1, 4, 9)[apply(cbind(f1, f4, f9), 1, which.max)]
  conf_tab = table(pred, test[, 1])
  conf_tab
##     
## pred  1  4  9
##    1 98 10  9
##    4  3 70 22
##    9  3 29 79
  # Part (b): lambda = 10
  Sigma_pen = Sigma + diag(nrow(Sigma)) * 10
  
  # calculate wk
  w1 = solve(Sigma_pen) %*% mu1
  w4 = solve(Sigma_pen) %*% mu4
  w9 = solve(Sigma_pen) %*% mu9
  
  # calculating bk
  b1 = - 0.5 * t(mu1) %*% solve(Sigma_pen) %*% mu1 + log(pik[1])
  b4 = - 0.5 * t(mu4) %*% solve(Sigma_pen) %*% mu4 + log(pik[2])
  b9 = - 0.5 * t(mu9) %*% solve(Sigma_pen) %*% mu9 + log(pik[3])
  
  # predicting new data 
  # calculate and compare the posteriori probability 
  f1 = as.matrix(test[, -1]) %*% w1 + as.numeric(b1)
  f4 = as.matrix(test[, -1]) %*% w4 + as.numeric(b4)
  f9 = as.matrix(test[, -1]) %*% w9 + as.numeric(b9)
  
  # accuracy
  pred = c(1, 4, 9)[apply(cbind(f1, f4, f9), 1, which.max)]
  conf_tab = table(pred, test[, 1])
  conf_tab
##     
## pred   1   4   9
##    1 100   2   1
##    4   2  89  22
##    9   2  18  87

\(\lambda\) serves as trading the bias-variance trade-off. When \(\lambda\) is small, the model is accurately estimating the covariance matrix (if the true model is MVN), but the variance is high. When \(\lambda\) is large, the model is estimating the covariance matrix with high bias, but the variance is low. In this case, the model with \(\lambda = 1\) is better than the model with \(\lambda = 10\).

  1. [20 pts] Another approach we could do is to perform PCA at the very beginning of this analysis, instead of screening for the top 300 variables. And then we can perform the same type of analysis as in part a. but with PCA as your variables (in both training and testing data).

    • Start with the original complete mnist data, and take digits 1, 6, and 7. Perform PCA on the pixels.
    • Take the first 50 PCs as your new dataset, and then split the data back to training and testing data, with the same sample size as part a).
    • Perform the same analysis as in part a. and report the confusion matrix and prediction error.

Comment on why do you think this approach would work well.

Answer:

  # Part (c)
  # get the first 50 PCs
  mnist167 = mnist[mnist$Digit == 1 | mnist$Digit == 4 | mnist$Digit == 9, ]
  mnist_pca = prcomp(mnist167[, -1], center = TRUE, scale = FALSE)$x[, 1:50]
  
  # split the data
  pca_trainx = mnist_pca[1:nrow(train), ]
  pca_testx = mnist_pca[-(1:nrow(train)), ]
  
  # estimate the parameters
  # separate datasets
  digit1 = pca_trainx[train$Digit == 1, ]
  digit4 = pca_trainx[train$Digit == 4, ]
  digit9 = pca_trainx[train$Digit == 9, ]
  
  # prior
  pik = c(nrow(digit1), nrow(digit4), nrow(digit9)) / nrow(train)
  
  # mu
  mu1 = colMeans(digit1)
  mu4 = colMeans(digit4)
  mu9 = colMeans(digit9)
  
  # the centered data
  C1_centered = scale(digit1, center = TRUE, scale = FALSE)
  C4_centered = scale(digit4, center = TRUE, scale = FALSE)
  C9_centered = scale(digit9, center = TRUE, scale = FALSE)
  
  # pooled covariance matrix
  Sigma = (t(C1_centered) %*% C1_centered + 
           t(C4_centered) %*% C4_centered + 
           t(C9_centered) %*% C9_centered) / (nrow(train) - 3)
  
  # calculate wk
  w1 = solve(Sigma) %*% mu1
  w4 = solve(Sigma) %*% mu4
  w9 = solve(Sigma) %*% mu9
  
  # calculating bk
  b1 = - 0.5 * t(mu1) %*% solve(Sigma) %*% mu1 + log(pik[1])
  b4 = - 0.5 * t(mu4) %*% solve(Sigma) %*% mu4 + log(pik[2])
  b9 = - 0.5 * t(mu9) %*% solve(Sigma) %*% mu9 + log(pik[3])
  
  # predicting new data 
  # calculate and compare the posteriori probability 
  f1 = as.matrix(pca_testx) %*% w1 + as.numeric(b1)
  f4 = as.matrix(pca_testx) %*% w4 + as.numeric(b4)
  f9 = as.matrix(pca_testx) %*% w9 + as.numeric(b9)
  
  # accuracy
  pred = c(1, 4, 9)[apply(cbind(f1, f4, f9), 1, which.max)]
  conf_tab = table(pred, test$Digit)
  conf_tab 
##     
## pred   1   4   9
##    1 103   3   0
##    4   1 101   6
##    9   0   5 104

Question 2 [35 pts]: Classification Trees

We will use the same PCA data you constructed in Question 1(c). Use the training data for model fitting and the testing data for model evaluation. Fit a 5-fold cross-validation CART model and answer the following question. There are many packages that can do this, you could consider the rpart package which provides cross-validation functionality and easy plotting. Do not print excessive output when using this package.

Answer:

  library(rpart)
  library(rpart.plot)

  cart.fit <- rpart(as.factor(Digit)~., 
                    data = data.frame(Digit=as.factor(train$Digit), pca_trainx),
                    control = rpart.control(xval=5))
  
  # get the cross-validation results
  cart.fit$cptable
##           CP nsplit rel error    xerror       xstd
## 1 0.49268293      0 1.0000000 1.0682927 0.04069263
## 2 0.22439024      1 0.5073171 0.5170732 0.04110227
## 3 0.03902439      2 0.2829268 0.3121951 0.03491807
## 4 0.01463415      4 0.2048780 0.2780488 0.03339897
## 5 0.01000000      9 0.1317073 0.2878049 0.03385090
  plotcp(cart.fit)

  # the 1-SD rule shows a CP value between 3 and 5 terminal nodes
  cptarg = sqrt(cart.fit$cptable[4,1]*cart.fit$cptable[5,1])
  prunedtree = prune(cart.fit,cp=cptarg)
  
  # the pruned tree has 9 terminal nodes
  rpart.plot(prunedtree)

  # apply on the testing data
  pred = predict(prunedtree, newdata = data.frame(pca_testx), type = "class")
  conf_tab = table(pred, test$Digit)
  conf_tab
##     
## pred  1  4  9
##    1 96  2  0
##    4  0 68 20
##    9  8 39 90