Instruction
About HW2
Question 1 [40 Points] KNN Classification (Diabetes)
Question 2 [40 Points] Write your own KNN for regression
Question 3 [30 Points] Curse of Dimensionality

Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to compass2g. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

What is expected for the submission to Gradescope
- You are required to submit one rendered PDF file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file generated by a .Rmd file. .html format cannot be accepted.
- Please follow the instructions on Gradescope to select corresponding PDF pages for each question.
Please note that your homework file is a PDF report instead of a messy collection of R codes. This report should include:
- Your Name and NetID. (Replace Ruoqing Zhu(rqzhu) by your name and NetID if you are using this template).
- Make all of your R code chunks visible for grading.
- Relevant outputs from your R code chunks that support your answers.
- Provide clear answers or conclusions for each question. For example, you could start with Answer: I fit SVM with the following choice of tuning parameters ...
- Many assignments require your own implementation of algorithms. Basic comments are strongly encouraged to explain the logic to our graders. However, line-by-line code comments are unnecessary.
Requirements regarding the .Rmd file.
- You do NOT need to submit Rmd files. However, your PDF file should be rendered directly from it.
- Make sure that you set random seeds for simulation or randomized algorithms so that the results are reproducible. If a specific seed number is not provided in the homework, you can consider using your NetID.
- For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.

About HW2

For this HW, we mainly try to understand the KNN method in both classification and regression settings and use it to perform several real data examples. Tuning the model will help us understand the bias-variance trade-off. A slightly more challenging task is to code a KNN method yourself. For that question, you cannot use any additional package to assist the calculation.

There is an important package, ElemStatLearn, which is the package associated with the ESL textbook for this course. Unfortunately, the package is currently discontinued on CRAN. You can install an earlier version of this package by using

    require(devtools)
    install_version("ElemStatLearn", version = "2015.6.26.2", repos = "http://cran.us.r-project.org")

And of course, you will have to install the devtools package if you don’t already have it.

Question 1 [40 Points] KNN Classification (Diabetes)

Load the Pima Indians Diabetes Database (PimaIndiansDiabetes) from the mlbench package. If you don’t already have this package installed, use the following code. It also randomly splits the data into training and testing. You should preserve this split in the analysis.

    # install.packages("mlbench") # run this line if you don't have the package
    library(mlbench)
    data(PimaIndiansDiabetes)
    
    set.seed(2)
    trainid = sample(1:nrow(PimaIndiansDiabetes), nrow(PimaIndiansDiabetes)/2)
    Diab.train = PimaIndiansDiabetes[trainid, ]
    Diab.test = PimaIndiansDiabetes[-trainid, ]

Hints
- Read the documentation of this dataset here.
- Make sure that you understand the goal of this is classification problem. Knn algorithms use different prediction strategies for classification and regression.

Use a grid of \(k\) values (every integer) from 1 to 20.

[10 pts] Fit a KNN model using Diab.train and calculate both training and testing errors. For the testing error, use Diab.test. Plot the two errors against the corresponding \(k\) values. Make sure that you differentiate them using different colors/shapes and add proper legends.
[15 pts] Does the plot match (approximately) our intuition of the bias-variance trade-off in terms of having a U-shaped error? What is the optimal \(k\) value based on this result? For the optimal k, what is the corresponding degrees-of-freedom and its error?
[15 pts] Suppose we do not have access to Diab.test data. Thus, we need to further split the training data into train and validation data to tune k. For this question, use the caret package to complete the tuning. You are required to
- Train the knn model with cross-validation using the train() function.
  - Specify the type of cross-validation using the trainControl() function. We need to use three-fold cross-validation.
  - Specify a grid of tuning parameters. This can be done using expand.grid(k = c(1:20)).
- Report the best parameter with its error. Compare it with your k in b).

For details, read either the example from SMLR or the documentation at here to learn how to use the trainControl() and train() functions. Some examples can also be found at here.

Question 2 [40 Points] Write your own KNN for regression

[10 pts] Generate \(p=5\) independent standard Normal covariates \(X_1, X_2, X_3, X_4, X_5\) of \(n = 1000\) independent observations. Then, generate \(Y\) from the regression model \[ Y = X_1 + 0.5 \times X_2 - X_3 + \epsilon,\] with i.i.d. standard normal error \(\epsilon\). Make sure to set a random seed 1 for reproducibility.

Use a KNN implementation from an existing package. Report the mean squared error (MSE) for your prediction with k = 5. Use the first 500 observations as the training data and the rest as testing data. Predict the \(Y\) values using your KNN function with k = 5. Mean squared error is \[\frac{1}{N}\sum_i (y_i - \widehat y_i)^2.\] This question also helps you validate your own function in b). a) and b) are expected have similar (possibly not identical) results.
Hints: this is a regression problem instead of a classification one.

[30 pts] For this question, you cannot use (load) any additional R package. Write your own function myknn(xtrain, ytrain, xtest, k) that fits a KNN model and predict multiple target points xtest. The function should return a variable ytest.
- Here, xtrain is the training dataset covariate value, ytrain is the training data outcome, and k is the number of nearest neighbors. ytest is the prediction on xtest.
- Use Euclidean distance to calculate the closeness between two points.
- Test your code by reporting the mean square error on the testing data.

Question 3 [30 Points] Curse of Dimensionality

Let’s consider a high-dimensional setting. Keep the data-generating model the same as question 2. In addition to the outcomes and covariates from question 2, we will also generate 95 more noisy variables to make p = 100. In this question, you can use a KNN function from any existing package.

We consider two different settings to generate that additional set of 95 covariates. Make sure to set random seeds for reproducibility.

Generate another 95-dimensional covariates with all independent standard Gaussian entries.
Generate another 95-dimensional covariates using the formula \(X^T A\), where \(X\) is the original 5-dimensional vector, and \(A\) is a \(5 \times 95\) dimensional (fixed) matrix that remains the same for all observations. You should generate \(A\) just once using i.i.d. uniform \([0, 1]\) entries and then apply \(A\) to your current 5-dimensional data.

Fit KNN in both settings (with the total of 100 covariates) and select the best \(k\) value. Answer the following questions

[10 pts] For the first setting, what is the best \(k\) and the best mean squared error for prediction?
[10 pts] For the second setting, what is the best \(k\) and the best mean squared error for prediction?
[10 pts] In which setting \(k\)NN performs better? Why?

STAT 542: Homework 2

Spring 2022, by Ruoqing Zhu (rqzhu)

Due: Thursday 11:59 PM CT, Feb 3

Instruction

About HW2

Question 1 [40 Points] KNN Classification (Diabetes)

Question 2 [40 Points] Write your own KNN for regression

Question 3 [30 Points] Curse of Dimensionality