Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: Tuning Random Forests

In our lecture, we mainly used the RLT package and the randomForest package. However, there are many other packages for random forests, for example, ranger, grf, randonForestSRC, etc. Let’s consider using the randonForestSRC package for this task, for a multi-class classification problem. First, process the MNIST data as the PCA approach in the previous HW:

* Start with the original complete `mnist` data with 2000 observations, and take digits 1, 6, and 7. Perform PCA on the pixels.
* Take the first 20 PCs as your new dataset, and then split the data back to training and testing data, with the same sample size as what you suppose to have in HW8.

Then, complete the following tasks using the randonForestSRC package:

* Use the `rfsrc` function to fit a random forest model with the default settings. Report the OOB error rate.
* Construct a grid of `mtry` and `nodesize` to tune the model. Use the `rfsrc` function to fit the model with the grid on the training data and select the best tuning. Report the OOB error rate for each model and report results (anything that you think is necessary) of the best model. Make sure to also calculate and present the variable importance. Comment on your results.
* Apply this model to the testing data and report the testing error. 

Be careful that different packages uses different notations for their parameters. It is strongly recommended to check the documentation of the rfsrc documentation to see what parameters you need to specify and what results you should extract (!!!) from the fitted model so that you will have the correct assessment of the result.

  # Load the data
  load("mnist_first2000.RData")
  dim(mnist)
## [1] 2000  785
  # get the first 50 PCs
  mnist167 = mnist[mnist$Digit == 1 | mnist$Digit == 6 | mnist$Digit == 7, ]
  mnist_pca = prcomp(mnist167[, -1], center = TRUE, scale = FALSE)$x[, 1:20]

  digits = mnist$Digit[1:1000]
  digits = digits[digits == 1 | digits == 6 | digits == 7]
  ntrain = length(digits)

  # split the data
  mnist167 = data.frame("Digit" = as.factor(mnist167$Digit), mnist_pca)
  train = mnist167[1:ntrain, ]
  test = mnist167[-c(1:ntrain), ]

  # fit the model
  library(randomForestSRC)
## 
##  randomForestSRC 3.2.2 
##  
##  Type rfsrc.news() to see new features, changes, and bug fixes. 
## 
  set.seed(432)
  rfsrc.fit = rfsrc(Digit ~ ., data = train, ntree = 1000)

  # get oob prediction error
  mean(rfsrc.fit$class.oob != train$Digit)
## [1] 0.03669725
  table(rfsrc.fit$class.oob, train$Digit)
##    
##       1   6   7
##   1 111   1   4
##   6   4  92   1
##   7   1   1 112
  # create a grid of mtry and nodesize
  grid = expand.grid(mtry = c(5, 10, 20), nodesize = c(1, 5, 10))

  # fit the model with the grid
  set.seed(432)
  allerrors = rep(NA, nrow(grid))

  for (i in 1:nrow(grid)) {
    rfsrc.fit = rfsrc(Digit ~ ., data = train, ntree = 1000, 
                      mtry = grid$mtry[i], nodesize = grid$nodesize[i])

    allerrors[i] = mean(rfsrc.fit$class.oob != train$Digit)
  }

  cbind(grid, allerrors)
  # get the best tuning
  best = grid[which.min(allerrors), ]
  
  # get the best model
  rfsrc.fit = rfsrc(Digit ~ ., data = train, ntree = 1000, 
                    mtry = best$mtry, nodesize = best$nodesize,
                    importance = TRUE)

  # apply to the testing data
  pred = predict(rfsrc.fit, test)$class
  mean(pred != test$Digit)
## [1] 0.01892744
  table(pred, test$Digit)
##     
## pred   1   6   7
##    1 102   0   0
##    6   0 106   4
##    7   2   0 103
  barplot(rfsrc.fit$importance[, 1])

The results show that random forest fits well with the first two PCs as the most important ones. This probably suggests that most of the variations associated with the pixels are also associated with the digits.

Question 2: Using xgboost for MNIST

  1. Use the xgboost package to fit the MNIST data multi-class classification problem. For this question, you should use all the data in your 2000 sample MNIST dataset, all digits and no PCA is needed. You should specify the following:

Report the testing error rate and the confusion matrix.

  library(xgboost)

  # define data
  traindata = mnist[1:1000, ]
  testdata = mnist[1001:2000, ]

  # Train the XGBoost model
  xgb_model <- xgboost(data = data.matrix(traindata[, -1]), label = traindata$Digit,
                       objective = "multi:softmax",
                       num_class = 10, 
                       booster = "gbtree", # Tree booster
                       eta = 0.5, # Learning rate
                       max_depth = 2, # Maximum depth of trees
                       nrounds = 50,
                       verbose = 0)

  # Evaluate the model on the test set
  predictions <- predict(xgb_model, newdata = data.matrix(testdata[, -1]))
  mean(predictions == test$label)
## [1] NaN
  table(predictions, testdata$Digit)
##            
## predictions  0  1  2  3  4  5  6  7  8  9
##           0 87  0  2  1  0  0  2  0  0  2
##           1  0 92  1  1  1  1  0  1  3  0
##           2  0  0 80  8  2  0  2  0  3  1
##           3  0  0  4 73  1  2  0  2  2  1
##           4  0  0  3  0 88  0  1  3  0  7
##           5  1  0  0  9  1 75  1  1  1  0
##           6  0  0  2  1  0  1 99  0  0  1
##           7  1  1  1  0  2  1  0 96  0  5
##           8  4 11  5  2  4  5  1  1 73  0
##           9  0  0  1  3 10  4  0  3  3 93
  1. The model fits with 50 rounds (trees) sequentially. However, you can produce your prediction using just a limited number of trees. This can be controlled using the iteration_range argument in the predict() function. Plot your prediction error vs. number of trees. Comment on your results.
  # get the prediction errors
  errors = rep(NA, 50)
  for (i in 1:50) {
    predictions <- predict(xgb_model, newdata = data.matrix(testdata[, -1]), iterationrange = c(1, i+1))
    errors[i] = mean(predictions != testdata$Digit)
  }

  plot(errors, type = "l")

  1. Tune your parameters of eta and max_depth to see if you can improve the performance.
  tunegrid = expand.grid(eta = c(0.1, 0.5, 1), max_depth  = c(2, 4, 6))
  allerrors = rep(NA, nrow(tunegrid))
  allnrounds = rep(NA, nrow(tunegrid))

  for (i in 1:nrow(tunegrid)) {
    xgb_model <- xgboost(data = data.matrix(traindata[, -1]), label = traindata$Digit,
                         objective = "multi:softmax",
                         num_class = 10, 
                         booster = "gbtree", # Tree booster
                         eta = tunegrid$eta[i], # Learning rate
                         max_depth = tunegrid$max_depth[i], # Maximum depth of trees
                         nrounds = 50,
                         verbose = 0)

    nrounds_pred = rep(NA, 50)
    for (k in 1:50)
    {
      predictions <- predict(xgb_model, newdata = data.matrix(testdata[, -1]), iterationrange = c(1, k+1))
      nrounds_pred[k] = mean(predictions != testdata$Digit)
    }

    allerrors[i] = min(nrounds_pred)
    allnrounds[i] = which.min(nrounds_pred)
  }

  cbind(tunegrid, allnrounds, allerrors)
  best = which.min(allerrors)
  cbind(tunegrid, allnrounds)[best, ]
  # Train the XGBoost model
  xgb_model <- xgboost(data = data.matrix(traindata[, -1]), label = traindata$Digit,
                       objective = "multi:softmax",
                       num_class = 10, 
                       booster = "gbtree", # Tree booster
                       eta = tunegrid$eta[best], # Learning rate
                       max_depth = tunegrid$max_depth[best], # Maximum depth of trees
                       nrounds = allnrounds[best],
                       verbose = 0)

  predictions <- predict(xgb_model, newdata = data.matrix(testdata[, -1]))

  mean(predictions != testdata$Digit)
## [1] 0.142
  table(predictions, testdata$Digit)
##            
## predictions  0  1  2  3  4  5  6  7  8  9
##           0 87  0  2  1  0  0  2  0  0  2
##           1  0 92  1  1  1  1  0  1  3  0
##           2  0  0 80  8  2  0  2  0  3  1
##           3  0  0  4 72  2  2  0  1  2  1
##           4  0  0  3  0 89  0  1  3  0  6
##           5  1  0  0 10  0 75  1  1  1  0
##           6  0  0  2  1  0  1 99  0  0  1
##           7  1  1  1  0  2  1  0 97  0  5
##           8  4 11  5  2  4  5  1  1 73  0
##           9  0  0  1  3  9  4  0  3  3 94