Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, there are two basic rules:
Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. Please refer to the course website for late submission policy and grading rubrics.
HWx_yourNetID.pdf. For example,
HW01_rqzhu.pdf. Please note that this must be a
.pdf file. .html format will not be
accepted because they are often not readable on gradescope.
Make all of your R code chunks visible for
grading.R is \(\geq 4.0.0\). This
will ensure your random seed generation is the same as everyone
else..Rmd file
as a template, be sure to remove this instruction
section.The goal of this exercise is to help you learn to read package
documentation and correctly use a faster implementation of random
forests. The original randomForest package is relatively
slow. The ranger package provides a faster alternative, but
some parameter names differ.
Carefully read the ranger
documentation to identify the parameter names corresponding to
mtry, nodesize and other specifications of the
model (used in our lectures). Then complete the following tasks:
a). [5 pts] Load the Cleveland Heart Disease dataset
(processed_cleveland.csv). You can download it from our
course website. Recode the outcome num so that
num > 0 is labeled 1, and 0
otherwise.
b). [5 pts] Remove any observations where ca or
thal equals "?", and convert these variables
to factors.
c). [10 pts] Fit random forests using the ranger()
function with:
Report the training error by predicting the training data on the fitted model.
d). [10 pts] Report the out-of-bag (OOB) prediction on the training data and explain how is it different from the training data error reported in (3). Which error is better and which one should we rely on? Why?
e). [15 pts] Perform a grid search to tune the mtry and
min.node.size parameters. Use the following grid:
f). [5 pts] The ranger package handles categorical
variables differently than the randomForest package by
default. This is mainly due to a mechanic them implement with the
respect.unordered.factors parameter. Read carefully the
documentation for this parameter and explain what it does. Discuss the
pros and cons of using respect.unordered.factors=partition
when fitting random forests with categorical variables.
In this question, we will conduct a small simulation study to analyze
the effect of the nodesize parameter in random forests.
This is a regression problem, and you should still use the
ranger package to complete this question. Do the
following:
a). [15 pts] Set up a simulation study to evaluate the prediction performance of random forests. Refer to our previous homework for ideas of simulation. Use the following settings:
nsim = 200 times.mtry = 1 and number of trees = 500 for
all simulations.c(10, 20, 30, 40, 50, 60).  library(MASS)
  n = 300
  set.seed(546)
  
  # fix this set of testing data for all simulations
  testx = mvrnorm(n, c(0, 0), matrix(c(1, 0.5, 0.5, 1), 2, 2))  
  
  # generate training data within each simulation
  trainx = mvrnorm(n, c(0, 0), matrix(c(1, 0.5, 0.5, 1), 2, 2))
  trainy = rnorm(n, mean = trainx[,1] + trainx[,2])
b). [5 pts] Plot the averaged Bias\(^2\) and Variance against the grid of
nodesize values. Provide a brief discussion of the results
in terms of how nodesize affects the Bias-Variance
trade-off in random forests.
xgboost for MNISTIn this question, we will use the xgboost package to
perform multi-class classification on the MNIST dataset. The data can be
obtained from HW5. You should use the first 1000 observations as the
training and the rest as the testing data, including all digits (0–9),
and no PCA is required.
a). [10 pts] Use the xgboost function to fit the MNIST
training data. Specify the following and report the testing error rate
and the confusion matrix.
objective = "multi:softmax" to handle multi-class
classification.num_class = 10 for the number of classes.booster = "gbtree".eta = 0.5 (learning rate)max_depth = 2 (maximum tree depth)nrounds = 50 (number of boosting iterations)b). [15 pts] The model fits with 50 (trees) sequentially. However,
you can produce your prediction using just a limited number of trees.
This can be controlled using the iterationrange argument in
the predict() function. Plot your prediction error
vs. number of trees. Comment on your results. The way to specify
iterationrange can be found at page 17 of https://cran.r-project.org/web/packages/xgboost/xgboost.pdf
c). [10 pts] Tune your parameters of eta and
max_depth to see if you can improve the performance.
eta and three values of max_depth.eta and
max_depth, obtain the best number of
iterationrange for predicting the testing data. This is not
a cross-validation, its just predicting the testing data.eta and
max_depth and the corresponding testing error.