Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: Handwritten Digits

The MNIST dataset of handwritten digits is one of the most popular imaging data during the early times of machine learning development. Many machine learning algorithms have pushed the accuracy to over 99% on this dataset. We will download the first 2000 observations of this dataset from an online resource using the following code. The first column is the digits. This is a fairly balanced dataset.

  # readin the data
  mnist <- read.csv("https://pjreddie.com/media/files/mnist_train.csv", nrows = 2000)
  colnames(mnist) = c("Digit", paste("Pixel", seq(1:784), sep = ""))
  save(mnist, file = "mnist_first2000.RData")

  # you can load the data with the following code
  # load("mnist_first2000.RData")
  dim(mnist)
## [1] 2000  785
  1. [20 pts] The first question is to write your own KNN model. Please review our lecture nodes on how the KNN model works and complete the following task:

    • Since the KNN model requires calculating the distance between the testing sample and all training samples, you need to first wirte a function with the sytex mydist(x0, trainx) to calculate the Euclidean distance between a target point covaraite vector x0 and a data matrix trainx. You are not allowed to use any existing functions in R to calculate the distance. The function should return a vector of length \(n\), where \(n\) is the number of rows in trainx.
    • Write a function with the syntax myknn(x0, trainx, trainy, k) to find the cloest k neighbors of x0. The function should return a vector of length k consisting the class labels of these neighbors. The function should take the following arguments:
      • x0 is a covariate vector of the target point
      • trainx: a data matrix of training covariates
      • trainy: a vector of training labels
      • k: the number of neighbors to use
    • Apply this function to predict the label of the 101th observation using the first 100 observations as the training data. Use \(k = 10\).
    • Based on the result, what should be the predicted label? Is it a correct prediction?
  2. [15 pts] Your code might be too slow. Let’s switch to an existing package for these tasks. Use the caret package to fit a KNN classification model. Use the first 1000 observations as the training data to tune parameters and fit the model.

    • Consider using AI tools to give you a sample code for using the caret package and its train() function
    • You should consider a range of k values. Make your own choice and use cross-validation to select the best k. What is the criterion you use for this selection?
    • Once you have your final model, use the rest 1000 observations as the testing data and report the prediction confusion matrix of the best model you selected
    • What is the prediction classification error?
  3. [15 pts] Now let’s try to use (multi-class) logistic regression to fit the data. Use the first 1000 observations as the training data and the rest as the testing data.

    • Use the glmnet package to fit a multi-class logistic regression model with Lasso penalty. Use cross-validation to select the best tuning parameter.
    • Consider using AI tools to give you a sample code to properly fit a multi-class logistic regression
    • Report the prediction confusion matrix of the best model you selected
    • What is the prediction classification error?

Question 2: Intrinsic Low Dimension

For this question, let’s setup a simulation study. We will consider a setting with some latent structures. You should generate \(n = 400\) observations with \(p = 100\). Use the first 200 as the training data and the rest as testing data. Perform the following steps:

Hence, the expected outcome depends only on the first two latent variables. The goal of this experiment is to observe how the KNN model could be affected by the dimensionality of the latent space. Please keep in mind that you do not observe \(Z\) in a real world, and you can only observe \(X\) and \(\mathbf{y}\). Perform the following:

  1. [25 pts] Fit a KNN regression using the generated data with \(m = 3\), and predict the corresponding testing data. Vary \(k\) in a grid seq(2, 82, 8). What are the testing errors? Repeat this experiment with \(m = 30\), and report the testing errors.

  2. [15 pts] Now let’s perform a simulation study that repeats 50 times. You should re-generate \(A\) each time. Average your prediction errors (of each \(k\)) over the simulation runs, just like the simulations we have done in previous HW. At the end, you should show a plot that summarizes and compare the prediction errors of two different settings, respectively. For example, using \(k\) as the horizontal axis, and prediction errors in the vertical axis, with two line, each representing a setting of \(m\).

  3. [10 points] In both settings, we are still using 100 variables to fit the KNN model, but the performences are very different. Can you comment on the results? Which setting is easier for KNN to obtain a better fit? Why is that?