Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to compass2g. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

About HW2

For this HW, we mainly try to understand the KNN method in both classification and regression settings and use it to perform several real data examples. Tuning the model will help us understand the bias-variance trade-off. A slightly more challenging task is to code a KNN method yourself. For that question, you cannot use any additional package to assist the calculation.

There is an important package, ElemStatLearn, which is the package associated with the ESL textbook for this course. Unfortunately, the package is currently discontinued on CRAN. You can install an earlier version of this package by using

    require(devtools)
    install_version("ElemStatLearn", version = "2015.6.26.2", repos = "http://cran.us.r-project.org")

And of course, you will have to install the devtools package if you don’t already have it.

Question 1 [40 Points] KNN Classification (Diabetes)

Load the Pima Indians Diabetes Database (PimaIndiansDiabetes) from the mlbench package. If you don’t already have this package installed, use the following code. It also randomly splits the data into training and testing. You should preserve this split in the analysis.

    # install.packages("mlbench") # run this line if you don't have the package
    library(mlbench)
    data(PimaIndiansDiabetes)
    
    set.seed(2)
    trainid = sample(1:nrow(PimaIndiansDiabetes), nrow(PimaIndiansDiabetes)/2)
    Diab.train = PimaIndiansDiabetes[trainid, ]
    Diab.test = PimaIndiansDiabetes[-trainid, ]

Use a grid of \(k\) values (every integer) from 1 to 20.

  1. [10 pts] Fit a KNN model using Diab.train and calculate both training and testing errors. For the testing error, use Diab.test. Plot the two errors against the corresponding \(k\) values. Make sure that you differentiate them using different colors/shapes and add proper legends.

  2. [15 pts] Does the plot match (approximately) our intuition of the bias-variance trade-off in terms of having a U-shaped error? What is the optimal \(k\) value based on this result? For the optimal k, what is the corresponding degrees-of-freedom and its error?

  3. [15 pts] Suppose we do not have access to Diab.test data. Thus, we need to further split the training data into train and validation data to tune k. For this question, use the caret package to complete the tuning. You are required to

    • Train the knn model with cross-validation using the train() function.
      • Specify the type of cross-validation using the trainControl() function. We need to use three-fold cross-validation.
      • Specify a grid of tuning parameters. This can be done using expand.grid(k = c(1:20)).
    • Report the best parameter with its error. Compare it with your k in b).

For details, read either the example from SMLR or the documentation at here to learn how to use the trainControl() and train() functions. Some examples can also be found at here.

Question 2 [40 Points] Write your own KNN for regression

  1. [10 pts] Generate \(p=5\) independent standard Normal covariates \(X_1, X_2, X_3, X_4, X_5\) of \(n = 1000\) independent observations. Then, generate \(Y\) from the regression model \[ Y = X_1 + 0.5 \times X_2 - X_3 + \epsilon,\] with i.i.d. standard normal error \(\epsilon\). Make sure to set a random seed 1 for reproducibility.
  1. [30 pts] For this question, you cannot use (load) any additional R package. Write your own function myknn(xtrain, ytrain, xtest, k) that fits a KNN model and predict multiple target points xtest. The function should return a variable ytest.
    • Here, xtrain is the training dataset covariate value, ytrain is the training data outcome, and k is the number of nearest neighbors. ytest is the prediction on xtest.
    • Use Euclidean distance to calculate the closeness between two points.
    • Test your code by reporting the mean square error on the testing data.

Question 3 [30 Points] Curse of Dimensionality

Let’s consider a high-dimensional setting. Keep the data-generating model the same as question 2. In addition to the outcomes and covariates from question 2, we will also generate 95 more noisy variables to make p = 100. In this question, you can use a KNN function from any existing package.

We consider two different settings to generate that additional set of 95 covariates. Make sure to set random seeds for reproducibility.

Fit KNN in both settings (with the total of 100 covariates) and select the best \(k\) value. Answer the following questions

  1. [10 pts] For the first setting, what is the best \(k\) and the best mean squared error for prediction?
  2. [10 pts] For the second setting, what is the best \(k\) and the best mean squared error for prediction?
  3. [10 pts] In which setting \(k\)NN performs better? Why?