Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, there are two basic rules:

Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. Please refer to the course website for late submission policy and grading rubrics.

Question 1: [50 pts] Fitting and Tuning Random Forests

The goal of this exercise is to help you learn to read package documentation and correctly use a faster implementation of random forests. The original randomForest package is relatively slow. The ranger package provides a faster alternative, but some parameter names differ.

Carefully read the ranger documentation to identify the parameter names corresponding to mtry, nodesize and other specifications of the model (used in our lectures). Then complete the following tasks:

a). [5 pts] Load the Cleveland Heart Disease dataset (processed_cleveland.csv). You can download it from our course website. Recode the outcome num so that num > 0 is labeled 1, and 0 otherwise.

b). [5 pts] Remove any observations where ca or thal equals "?", and convert these variables to factors.

c). [10 pts] Fit random forests using the ranger() function with:

Report the training error by predicting the training data on the fitted model.

d). [10 pts] Report the out-of-bag (OOB) prediction on the training data and explain how is it different from the training data error reported in (3). Which error is better and which one should we rely on? Why?

e). [15 pts] Perform a grid search to tune the mtry and min.node.size parameters. Use the following grid:

f). [5 pts] The ranger package handles categorical variables differently than the randomForest package by default. This is mainly due to a mechanic them implement with the respect.unordered.factors parameter. Read carefully the documentation for this parameter and explain what it does. Discuss the pros and cons of using respect.unordered.factors=partition when fitting random forests with categorical variables.

Question 2: [20 pts] Effect of Node Size in Random Forests

In this question, we will conduct a small simulation study to analyze the effect of the nodesize parameter in random forests. This is a regression problem, and you should still use the ranger package to complete this question. Do the following:

a). [15 pts] Set up a simulation study to evaluate the prediction performance of random forests. Refer to our previous homework for ideas of simulation. Use the following settings:

  library(MASS)
  n = 300
  set.seed(546)
  
  # fix this set of testing data for all simulations
  testx = mvrnorm(n, c(0, 0), matrix(c(1, 0.5, 0.5, 1), 2, 2))  
  
  # generate training data within each simulation
  trainx = mvrnorm(n, c(0, 0), matrix(c(1, 0.5, 0.5, 1), 2, 2))
  trainy = rnorm(n, mean = trainx[,1] + trainx[,2])

b). [5 pts] Plot the averaged Bias\(^2\) and Variance against the grid of nodesize values. Provide a brief discussion of the results in terms of how nodesize affects the Bias-Variance trade-off in random forests.

Question 3: [30 pts] Using xgboost for MNIST

In this question, we will use the xgboost package to perform multi-class classification on the MNIST dataset. The data can be obtained from HW5. You should use the first 1000 observations as the training and the rest as the testing data, including all digits (0–9), and no PCA is required.

a). [10 pts] Use the xgboost function to fit the MNIST training data. Specify the following and report the testing error rate and the confusion matrix.

b). [15 pts] The model fits with 50 (trees) sequentially. However, you can produce your prediction using just a limited number of trees. This can be controlled using the iterationrange argument in the predict() function. Plot your prediction error vs. number of trees. Comment on your results. The way to specify iterationrange can be found at page 17 of https://cran.r-project.org/web/packages/xgboost/xgboost.pdf

c). [10 pts] Tune your parameters of eta and max_depth to see if you can improve the performance.