Stat 432 Homework 5

Instruction
Question 1: [20 pts] KNN Implementation and Simulation
Question 2: [25 pts] Bias and Variance of KNN
Question 3: [25 pts] Prediction Error Comparison
Question 4: [30 pts] MNIST with KNN Multi-class Classification

Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, there are two basic rules:

sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. This simply means: write your own solution.
You must indicate clearly the level of involvement of AI tools (ChatGPT, Copilot, etc.) at the beginning of your homework. This includes whether you used it to generate code, debug code, or write explanations. If you did not use any AI tools, please state that explicitly. You are encouraged to briefly document your prompt(s) especially when AI is heavily involved. Please be aware that you are responsible any consequences of using AI tools, including incorrect or misleading information.

Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. Please refer to the course website for late submission policy and grading rubrics.

You are required to submit the rendered file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file. .html format will not be accepted because they are often not readable on gradescope. Make all of your R code chunks visible for grading.
Include your Name and NetID at the beginning of your homework.
Make sure that you set seed properly so that the results can be replicated if needed. Make sure the version of your R is \(\geq 4.0.0\). This will ensure your random seed generation is the same as everyone else.
For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.
If you use this file or the example homework .Rmd file as a template, be sure to remove this instruction section.

Question 1: [20 pts] KNN Implementation and Simulation

The goal of Question 1 and 2 serves the purpose of comparing KNN with linear regression models in different settings. But first of all, let’s write a KNN function ourselves and compare that with some existing packages. Here is a data generate that we will consider:

\[ Y = 0.5 X_1 + \sin(X_2) - 0.3*X_3^2 + \epsilon_i, \quad i = 1, \ldots, n \]

Here for each observation, we will generate \(X_1, X_2, X_3\) from i.i.d. standard normal and \(\epsilon_i\) are i.i.d. \(\cal N(0, 0.1^2)\). We will consider \(n = 400\) observations as the training data and another 1000 observations as the testing data. Follow the steps below:

Generate the training and testing data as required
Write a function called myknn(trainx, trainy, testx, k) to perform KNN regression. You can refer to the lecture notes for the details of how KNN works. You are not allowed to use any existing KNN functions in R to complete this task, but you may use other functions to help the calculation. More specifically, the function should do the following:
- For each testing observation, calculate the distance between this observation and all training observations.
- Find the closest k neighbors and return the average of their y values as the prediction.
- To keep your code efficient, you should only use one for loop in the function.
Use your function to fit the training data and predict the testing data. Vary k in a grid seq(1, 10, 1).
Provide a plot of k vs. prediction mean squared error (MSE) on the testing data. For your optimal k, provide a scatter plot of the predicted values vs. the true values on the testing data.
Comment on the results. How does k affect the prediction error and how?

Answer:

  # set the seed
  set.seed(432)

  # generate the training data
  ntrain = 400
  ntest = 1000
  n = ntrain + ntest
  p = 3
  
  X = matrix(rnorm(n*p), nrow = n, ncol = p)
  Y = 0.5*X[, 1] + sin(X[, 2]) - 0.3*X[, 3]^2 + rnorm(n, 0, 0.1)
  trainx = X[1:ntrain, ]
  trainy = Y[1:ntrain]
  testx = X[(ntrain+1):n, ]
  testy = Y[(ntrain+1):n]
  
  # define the myknn(trainx, trainy, testx, k) function
  myknn <- function(trainx, trainy, testx, k) {
    ntest = nrow(testx)
    pred = rep(NA, ntest)
    
    for (i in 1:ntest) {
      # calculate the distance between the i-th test observation and all training observations
      dist = rowSums(sweep(trainx, 2, testx[i, ], "-")^2)
      
      # find the k nearest neighbors, I will used the squared distance here since the rank is the same
      knn = order(dist)[1:k]
      
      # return the average of their y values as the prediction
      pred[i] = mean(trainy[knn])
    }
    
    return(pred)
  }
  
  # fit the model and predict the testing data
  allk = seq(1, 10, 1)
  allerrors = rep(NA, length(allk))
  for (j in 1:length(allk)) {
    pred = myknn(trainx, trainy, testx, allk[j])
    allerrors[j] = mean((pred - testy)^2)
  }
  
  # plot k vs. prediction MSE
  plot(allk, allerrors, type = "b", xlab = "k", ylab = "Prediction MSE", main = "k vs. Prediction MSE")

  # find the optimal k
  optimal_k = allk[which.min(allerrors)]
  optimal_k

## [1] 3

  pred_optimal = myknn(trainx, trainy, testx, optimal_k)
  
  # scatter plot of the predicted values vs. the true values
  plot(testy, pred_optimal, xlab = "True Values", ylab = "Predicted Values", main = paste("Predicted vs. True Values (k =", optimal_k, ")"))
  abline(0, 1, col = "red")

Question 2: [25 pts] Bias and Variance of KNN

Similar to our previous homework, we can use a simulation study to understand the bias and variance of KNN. To keep this simple, we will consider just one testing point at \((0.5, 0.7, 1)\). Keep in mind that based on our understanding of the bias-variance trade-off, we need to perform repeated simulations to get estimates of the outcome at this point, and then calculate the bias and variance across all simulations. Use the same model setting and training data size as in Question 1. You should try to figure out the procedure of this simulation based on the derivation we did in class. After completing the simulation, you should be able to plot the bias, variance and total error (bias\(^2\) + variance) vs. k = seq(1, 29, 2) in a single plot. Comment on the findings.

Answer:

  # set the seed
  set.seed(432)

  # define a new knn function for this question
  myknn_multi <- function(trainx, trainy, testx, allk) {

    pred = rep(NA, length(allk))
  
    # calculate the distance between testx and all training observations
    dist = rowSums(sweep(trainx, 2, testx, "-")^2)
    orderdist = order(dist)
    
    for (i in 1:length(allk)) {
      # find the k nearest neighbors, I will used the squared distance here since the rank is the same
      knn = orderdist[1: allk[i]]
      
      # return the average of their y values as the prediction
      pred[i] = mean(trainy[knn])
    }
    
    return(pred)
  }
  
  # perform the simulation
  ntrain = 400
  p = 3
  ntest = 1
  nsim = 1000
  testpoint = matrix(c(0.5, 0.7, 1), nrow = 1)

  allk = seq(1, 29, 2)
  allpred = matrix(NA, nrow = nsim, ncol = length(allk))

  for (i in 1:nsim) {
    # generate the training data
    trainx = matrix(rnorm(ntrain*p), nrow = ntrain, ncol = p)
    trainy = 0.5*trainx[, 1] + sin(trainx[, 2]) - 0.3*trainx[, 3]^2 + rnorm(ntrain, 0, 0.1)
    
    # predict the testing point
    allpred[i, ] = myknn_multi(trainx, trainy, testpoint, allk)
  }
  
  # calculate the bias, variance and total error
  true_value = 0.5*testpoint[1] + sin(testpoint[2]) - 0.3*testpoint[3]^2
  bias = colMeans(allpred) - true_value
  variance = apply(allpred, 2, var)
  total_error = bias^2 + variance
  
  # plot the bias, variance and total error
  plot(allk, bias^2, type = "b", col = "blue", xlab = "k", ylab = "Error", ylim = c(0, max(total_error)), main = "Bias, Variance and Total Error vs. k")
  lines(allk, variance, type = "b", col = "red")
  lines(allk, total_error, type = "b", col = "green")
  legend("topright", legend = c("Bias^2", "Variance", "Total Error"), col = c("blue", "red", "green"), lty = 1)

Question 3: [25 pts] Prediction Error Comparison

Let’s compare the performance of your KNN and Lasso. We will inherit the most of the setting as in Question 1 (training and testing sample size and model). However, make the following changes:

Setting 1: Generate \(p = 30\) instead of \(p = 3\). The outcome is still generated using the first three covariates as in Question 1.
Setting 2: Generate \(p = 30\) but from a multivariate normal distribution with mean zero, and covariance matrix \(\Sigma\), where the diagonal entries of \(\Sigma\) are all 1, and the off-diagonal entries are all 0.8. The outcome is still generated using the first three covariates as in Question 1.

Compare the performance of KNN and Lasso in these two settings. You should use the glmnet package to fit a Lasso model. Use cross-validation to get lambda.min. For your KNN, use k = 5. You should report the prediction MSE on the testing data for both methods in both settings. You only need to perform this once for each setting. No repeated simulations are needed. Comment on your findings,

What changes do you observe in these two settings?
What is the main difference between these two settings that may have caused such changes?

Answer:

  # set the seed
  set.seed(432)
  library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-9

  # Setting 1: p = 30 with independent covariates
  ntrain = 400
  ntest = 1000
  n = ntrain + ntest
  p = 30
  
  X1 = matrix(rnorm(n*p), nrow = n, ncol = p)
  Y1 = 0.5*X1[, 1] + sin(X1[, 2]) - 0.3*X1[, 3]^2 + rnorm(n, 0, 0.1)
  
  trainx1 = X1[1:ntrain, ]
  trainy1 = Y1[1:ntrain]
  testx1 = X1[(ntrain+1):n, ]
  testy1 = Y1[(ntrain+1):n]
  
  # fit a Lasso model
  lasso1 <- cv.glmnet(as.matrix(trainx1), trainy1, alpha = 1, nfolds = 5)
  
  # predict the testing data using Lasso
  pred_lasso1 <- predict(lasso1, newx = as.matrix(testx1), s = "lambda.min")
  
  # calculate the prediction MSE for Lasso
  mse_lasso1 <- mean((pred_lasso1 - testy1)^2)
  
  # predict the testing data using KNN with k = 5
  pred_knn1 <- myknn(trainx1, trainy1, testx1, k = 5)
  
  # calculate the prediction MSE for KNN
  mse_knn1 <- mean((pred_knn1 - testy1)^2)
  
  # print the results for Setting 1
  cat("Setting 1 (p = 30 with independent covariates):\n
      Lasso MSE:", mse_lasso1, "\n
      KNN MSE:", mse_knn1, "\n\n")

## Setting 1 (p = 30 with independent covariates):
## 
##       Lasso MSE: 0.2399099 
## 
##       KNN MSE: 0.5893325

  # Setting 2: p = 30 with correlated covariates
  library(MASS)
  
  # generate the covariance matrix
  Sigma = matrix(0.8, nrow = p, ncol = p)
  diag(Sigma) = 1
  
  X2 = mvrnorm(n, mu = rep(0, p), Sigma = Sigma)
  Y2 = 0.5*X2[, 1] + sin(X2[, 2]) - 0.3*X2[, 3]^2 + rnorm(n, 0, 0.1)
  
  trainx2 = X2[1:ntrain, ]
  trainy2 = Y2[1:ntrain]
  testx2 = X2[(ntrain+1):n, ]
  testy2 = Y2[(ntrain+1):n]
  
  # fit a Lasso model
  lasso2 <- cv.glmnet(as.matrix(trainx2), trainy2, alpha = 1, nfolds = 5)
  
  # predict the testing data using Lasso
  pred_lasso2 <- predict(lasso2, newx = as.matrix(testx2), s = "lambda.min")
  
  # calculate the prediction MSE for Lasso
  mse_lasso2 <- mean((pred_lasso2 - testy2)^2)
  
  # predict the testing data using KNN with k = 5
  pred_knn2 <- myknn(trainx2, trainy2, testx2, k = 5)
  
  # calculate the prediction MSE for KNN
  mse_knn2 <- mean((pred_knn2 - testy2)^2)
  
  # print the results for Setting 2
  cat("Setting 2 (p = 30 with correlated covariates):\n
      Lasso MSE:", mse_lasso2, "\n
      KNN MSE:", mse_knn2, "\n")

## Setting 2 (p = 30 with correlated covariates):
## 
##       Lasso MSE: 0.2338774 
## 
##       KNN MSE: 0.171011

Question 4: [30 pts] MNIST with KNN Multi-class Classification

The MNIST dataset of handwritten digits is one of the most popular imaging data during the early times of machine learning development. Many machine learning algorithms have pushed the accuracy to over 99% on this dataset. We will use the first 2000 observations of this dataset. You can download this from our course website. The first column is the digits. This is a fairly balanced dataset.

  load("mnist_first2000.RData")
  dim(mnist)

## [1] 2000  785

Modify your KNN code for classification with multi-class labels. You already did this for regression in Question 1. Now you need to output the majority label among the k nearest neighbors. For ties, return a randomly selected label among the tied labels. Apply this function to predict the label of the 501 - 2000th observation using the first 500 observations as the training data. Use \(k = 10\) and report the prediction error as a contingency table. Which digits are more likely to be misclassified?

  # set the seed
  set.seed(432)

  # define the myknn_classification function
  myknn_class <- function(trainx, trainy, testx, k) {
    ntest = nrow(testx)
    pred = rep(NA, ntest)
    
    for (i in 1:ntest) {
      # calculate the distance between the i-th test observation and all training observations
      dist = rowSums(sweep(trainx, 2, testx[i, ], "-")^2)
      
      # find the k nearest neighbors
      knn = order(dist)[1:k]
      
      # return the majority label among the k nearest neighbors
      knn_labels = trainy[knn]
      label_counts = table(knn_labels)
      max_count = max(label_counts)
      majority_labels = names(label_counts[label_counts == max_count])
      
      if (length(majority_labels) > 1) {
        # tie case: randomly select one of the tied labels
        pred[i] = sample(majority_labels, 1)
      } else {
        pred[i] = majority_labels
      }
    }
    
    return(pred)
  }
  
  # split the data into training and testing sets
  trainx_mnist = data.matrix(mnist[1:500, -1])
  trainy_mnist = data.matrix(mnist[1:500, 1])
  testx_mnist = data.matrix(mnist[501:2000, -1])
  testy_mnist = data.matrix(mnist[501:2000, 1])
  
  # predict the testing data using KNN with k = 10
  pred_mnist = myknn_class(trainx_mnist, trainy_mnist, testx_mnist, k = 10)
  
  # create a contingency table of true vs. predicted labels
  contingency_table = table(True = testy_mnist, Predicted = pred_mnist)
  print(contingency_table)

##     Predicted
## True   0   1   2   3   4   5   6   7   8   9
##    0 124   1   3   1   0   8   3   0   1   0
##    1   0 154   0   0   0   0   0   0   0   0
##    2   2  31  84   1   7   3   3   9   4   2
##    3   2  11   1 106   0  10   0   1   4   5
##    4   0  12   0   0 118   1   2   0   0  29
##    5   3   5   1  24   1  92   6   1   0   9
##    6   1  17   1   0   4   4 128   0   0   0
##    7   0  13   0   0   4   1   0 136   0  18
##    8   1  12   0  15   2   2   3   1  81  16
##    9   2   4   1   3   5   0   2   3   0 135

  # calculate the prediction error rate
  error_rate = sum(diag(contingency_table)) / sum(contingency_table)
  cat("Prediction Accuracy:", error_rate, "\n")

## Prediction Accuracy: 0.772

  # identify which digits are more likely to be misclassified
  # 2 is often misclassified as 1
  as.numeric(which.min(diag(contingency_table) / rowSums(contingency_table)) - 1)

## [1] 2

Stat 432 Homework 5

Assigned: Sep 22, 2025; Due: 11:59 PM CT, Oct 2, 2025

Instruction

Question 1: [20 pts] KNN Implementation and Simulation

Question 2: [25 pts] Bias and Variance of KNN

Question 3: [25 pts] Prediction Error Comparison

Question 4: [30 pts] MNIST with KNN Multi-class Classification