Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, there are two basic rules:

Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. Please refer to the course website for late submission policy and grading rubrics.

Question 1: [20 pts] KNN Implementation and Simulation

The goal of Question 1 and 2 serves the purpose of comparing KNN with linear regression models in different settings. But first of all, let’s write a KNN function ourselves and compare that with some existing packages. Here is a data generate that we will consider:

\[ Y = 0.5 X_1 + \sin(X_2) - 0.3 X_3^2 + \epsilon_i, \quad i = 1, \ldots, n \]

Here for each observation, we will generate \(X_1, X_2, X_3\) from i.i.d. standard normal and \(\epsilon_i\) are i.i.d. \(\cal N(0, 0.1^2)\). We will consider \(n = 400\) observations as the training data and another 1000 observations as the testing data. Follow the steps below:

Question 2: [25 pts] Bias and Variance of KNN

Similar to our previous homework, we can use a simulation study to understand the bias and variance of KNN. To keep this simple, we will consider just one testing point at \((0.5, 0.7, 1)\). Keep in mind that based on our understanding of the bias-variance trade-off, we need to perform repeated simulations to get estimates of the outcome at this point, and then calculate the bias and variance across all simulations. Use the same model setting and training data size as in Question 1. You should try to figure out the procedure of this simulation based on the derivation we did in class. After completing the simulation, you should be able to plot the bias, variance and total error (bias\(^2\) + variance) vs. k = seq(1, 29, 2) in a single plot. Comment on the findings. Do you observe the trade off?

Question 3: [25 pts] Prediction Error Comparison

Let’s compare the performance of your KNN and Lasso. We will inherit the most of the setting as in Question 1 (training and testing sample size and model). However, make the following changes:

Compare the performance of KNN and Lasso in these two settings. You should use the glmnet package to fit a Lasso model. Use cross-validation to get lambda.min. For your KNN, use k = 5. You should report the prediction MSE on the testing data for both methods in both settings. You only need to perform this once for each setting. No repeated simulations are needed. Comment on your findings,

Question 4: [30 pts] MNIST with KNN Multi-class Classification

The MNIST dataset of handwritten digits is one of the most popular imaging data during the early times of machine learning development. Many machine learning algorithms have pushed the accuracy to over 99% on this dataset. We will use the first 2000 observations of this dataset. You can download this from our course website. The first column is the digits. This is a fairly balanced dataset.

  load("mnist_first2000.RData")
  dim(mnist)
## [1] 2000  785

Modify your KNN code for classification with multi-class labels. You already did this for regression in Question 1. Now you need to output the majority label among the k nearest neighbors. For ties, return a randomly selected label among the tied labels. Apply this function to predict the label of the 501 - 2000th observation using the first 500 observations as the training data. Use \(k = 10\) and report the prediction error as a contingency table. Which digits are more likely to be misclassified?