Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to reinstall all of your packages.In our lecture, we mainly used the RLT
package and the
randomForest
package. However, there are many other
packages for random forests, for example, ranger
,
grf
, randonForestSRC
, etc. Let’s consider
using the randonForestSRC
package for this task, for a
multi-class classification problem. First, process the
MNIST
data as the PCA approach in the previous HW:
mnist
data with 2000
observations, and take digits 1, 6, and 7. Perform PCA on the
pixels.Then, complete the following tasks using the
randonForestSRC
package:
rfsrc
function to fit a random
forest model with the default settings. Report the OOB error rate.mtry
and
nodesize
to tune the model. Use the rfsrc
function to fit the model with the grid on the training data and select
the best tuning. Report the OOB error rate for each model and report
results (anything that you think is necessary) of the best model. Make
sure to also calculate and present the variable importance. Comment on
your results.Be careful that different packages uses different notations for their
parameters. It is strongly recommended to check the documentation of the
rfsrc
documentation to see what parameters you need to specify and what
results you should extract (!!!) from the fitted model so that you will
have the correct assessment of the result.
xgboost
for MNIST[20 Points] Use the xgboost
package to fit the MNIST
data multi-class classification problem. For this question, you should
use all the data in your 2000 sample MNIST dataset, all digits and no
PCA is needed. You should specify the following:
multi:softmax
as the objective function so that
it can handel multi-class classification.num_class = 10
to specify the number of
classes.eta
= 0.5max_depth
= 2nrounds
= 50Report the testing error rate and the confusion matrix.
[20 Points] The model fits with 50 rounds
(trees)
sequentially. However, you can produce your prediction using just a
limited number of trees. This can be controlled using the
iterationrange
argument in the predict()
function. Plot your prediction error vs. number of trees. Comment on
your results.
[20 Points] Tune your parameters of eta
and
max_depth
to see if you can improve the performance.
eta
and three values of max_depth
.eta
and
max_depth
, obtain the best number of
iterationrange
for predicting the testing data. This is not
a cross-validation, its just predicting the testing data. The way to
specify iterationrange
can be found at page 17 of https://cran.r-project.org/web/packages/xgboost/xgboost.pdfeta
and
max_depth
and the corresponding testing error.