Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to compass2g. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
What is expected for the submission to Gradescope
HWx_yourNetID.pdf
. For example, HW01_rqzhu.pdf
. Please note that this must be a .pdf
file generated by a .Rmd
file. .html
format cannot be accepted.Please note that your homework file is a PDF report instead of a messy collection of R codes. This report should include:
Ruoqing Zhu(rqzhu)
by your name and NetID if you are using this template).R
code chunks visible for grading.R
code chunks that support your answers.Answer: I fit SVM with the following choice of tuning parameters ...
Requirements regarding the .Rmd
file.
Rmd
files. However, your PDF file should be rendered directly from it.For this HW, we mainly try to understand the KNN method in both classification and regression settings and use it to perform several real data examples. Tuning the model will help us understand the bias-variance trade-off. A slightly more challenging task is to code a KNN method yourself. For that question, you cannot use any additional package to assist the calculation.
There is an important package, ElemStatLearn
, which is the package associated with the ESL textbook for this course. Unfortunately, the package is currently discontinued on CRAN. You can install an earlier version of this package by using
require(devtools)
install_version("ElemStatLearn", version = "2015.6.26.2", repos = "http://cran.us.r-project.org")
And of course, you will have to install the devtools
package if you don’t already have it.
Load the Pima Indians Diabetes Database (PimaIndiansDiabetes
) from the mlbench
package. If you don’t already have this package installed, use the following code. It also randomly splits the data into training and testing. You should preserve this split in the analysis.
# install.packages("mlbench") # run this line if you don't have the package
library(mlbench)
data(PimaIndiansDiabetes)
set.seed(2)
trainid = sample(1:nrow(PimaIndiansDiabetes), nrow(PimaIndiansDiabetes)/2)
Diab.train = PimaIndiansDiabetes[trainid, ]
Diab.test = PimaIndiansDiabetes[-trainid, ]
Use a grid of \(k\) values (every integer) from 1 to 20.
[10 pts] Fit a KNN model using Diab.train
and calculate both training and testing errors. For the testing error, use Diab.test
. Plot the two errors against the corresponding \(k\) values. Make sure that you differentiate them using different colors/shapes and add proper legends.
[15 pts] Does the plot match (approximately) our intuition of the bias-variance trade-off in terms of having a U-shaped error? What is the optimal \(k\) value based on this result? For the optimal k
, what is the corresponding degrees-of-freedom and its error?
[15 pts] Suppose we do not have access to Diab.test
data. Thus, we need to further split the training data into train and validation data to tune k
. For this question, use the caret
package to complete the tuning. You are required to
train()
function.
trainControl()
function. We need to use three-fold cross-validation.expand.grid(k = c(1:20))
.k
in b).For details, read either the example from SMLR or the documentation at here to learn how to use the trainControl()
and train()
functions. Some examples can also be found at here.
k = 5
. Use the first 500 observations as the training data and the rest as testing data. Predict the \(Y\) values using your KNN function with k = 5
. Mean squared error is \[\frac{1}{N}\sum_i (y_i - \widehat y_i)^2.\] This question also helps you validate your own function in b). a) and b) are expected have similar (possibly not identical) results.R
package. Write your own function myknn(xtrain, ytrain, xtest, k)
that fits a KNN model and predict multiple target points xtest
. The function should return a variable ytest
.
xtrain
is the training dataset covariate value, ytrain
is the training data outcome, and k
is the number of nearest neighbors. ytest
is the prediction on xtest
.Let’s consider a high-dimensional setting. Keep the data-generating model the same as question 2. In addition to the outcomes and covariates from question 2, we will also generate 95 more noisy variables to make p = 100. In this question, you can use a KNN function from any existing package.
We consider two different settings to generate that additional set of 95 covariates. Make sure to set random seeds for reproducibility.
Fit KNN in both settings (with the total of 100 covariates) and select the best \(k\) value. Answer the following questions