Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: Tuning Random Forests

In our lecture, we mainly used the RLT package and the randomForest package. However, there are many other packages for random forests, for example, ranger, grf, randonForestSRC, etc. Let’s consider using the randonForestSRC package for this task, for a multi-class classification problem. First, process the MNIST data as the PCA approach in the previous HW:

Then, complete the following tasks using the randonForestSRC package:

Be careful that different packages uses different notations for their parameters. It is strongly recommended to check the documentation of the rfsrc documentation to see what parameters you need to specify and what results you should extract (!!!) from the fitted model so that you will have the correct assessment of the result.

Question 2: Using xgboost for MNIST

  1. [20 Points] Use the xgboost package to fit the MNIST data multi-class classification problem. For this question, you should use all the data in your 2000 sample MNIST dataset, all digits and no PCA is needed. You should specify the following:

    • Using multi:softmax as the objective function so that it can handel multi-class classification.
    • Using num_class = 10 to specify the number of classes.
    • Choose the correct base learner
    • Tune these parameters:
      • The learning rate eta = 0.5
      • The maximum depth of trees max_depth = 2
      • The number of trees nrounds = 50

Report the testing error rate and the confusion matrix.

  1. [20 Points] The model fits with 50 rounds (trees) sequentially. However, you can produce your prediction using just a limited number of trees. This can be controlled using the iterationrange argument in the predict() function. Plot your prediction error vs. number of trees. Comment on your results.

  2. [20 Points] Tune your parameters of eta and max_depth to see if you can improve the performance.

    • For computational efficiency, consider just three values of eta and three values of max_depth.
    • For each tuning combination of eta and max_depth, obtain the best number of iterationrange for predicting the testing data. This is not a cross-validation, its just predicting the testing data. The way to specify iterationrange can be found at page 17 of https://cran.r-project.org/web/packages/xgboost/xgboost.pdf
    • Report the best tuning of eta and max_depth and the corresponding testing error.