Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, there are two basic rules:
Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. Please refer to the course website for late submission policy and grading rubrics.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format will not be
accepted because they are often not readable on gradescope.
Make all of your R
code chunks visible for
grading.R
is \(\geq 4.0.0\). This
will ensure your random seed generation is the same as everyone
else..Rmd
file
as a template, be sure to remove this instruction
section.The goal of Question 1 and 2 serves the purpose of comparing KNN with linear regression models in different settings. But first of all, let’s write a KNN function ourselves and compare that with some existing packages. Here is a data generate that we will consider:
\[ Y = 0.5 X_1 + \sin(X_2) - 0.3*X_3^2 + \epsilon_i, \quad i = 1, \ldots, n \]
Here for each observation, we will generate \(X_1, X_2, X_3\) from i.i.d. standard normal and \(\epsilon_i\) are i.i.d. \(\cal N(0, 0.1^2)\). We will consider \(n = 400\) observations as the training data and another 1000 observations as the testing data. Follow the steps below:
myknn(trainx, trainy, testx, k)
to perform KNN regression. You can refer to the lecture notes for the
details of how KNN works. You are not allowed to use any existing KNN
functions in R
to complete this task, but you may use other
functions to help the calculation. More specifically, the function
should do the following:
k
neighbors and return the average of
their y
values as the prediction.for
loop in the function.k
in a grid seq(1, 10, 1)
.k
vs. prediction mean squared error
(MSE) on the testing data. For your optimal k
, provide a
scatter plot of the predicted values vs. the true values on the testing
data.k
affect the
prediction error and how?Answer:
# set the seed
set.seed(432)
# generate the training data
ntrain = 400
ntest = 1000
n = ntrain + ntest
p = 3
X = matrix(rnorm(n*p), nrow = n, ncol = p)
Y = 0.5*X[, 1] + sin(X[, 2]) - 0.3*X[, 3]^2 + rnorm(n, 0, 0.1)
trainx = X[1:ntrain, ]
trainy = Y[1:ntrain]
testx = X[(ntrain+1):n, ]
testy = Y[(ntrain+1):n]
# define the myknn(trainx, trainy, testx, k) function
myknn <- function(trainx, trainy, testx, k) {
ntest = nrow(testx)
pred = rep(NA, ntest)
for (i in 1:ntest) {
# calculate the distance between the i-th test observation and all training observations
dist = rowSums(sweep(trainx, 2, testx[i, ], "-")^2)
# find the k nearest neighbors, I will used the squared distance here since the rank is the same
knn = order(dist)[1:k]
# return the average of their y values as the prediction
pred[i] = mean(trainy[knn])
}
return(pred)
}
# fit the model and predict the testing data
allk = seq(1, 10, 1)
allerrors = rep(NA, length(allk))
for (j in 1:length(allk)) {
pred = myknn(trainx, trainy, testx, allk[j])
allerrors[j] = mean((pred - testy)^2)
}
# plot k vs. prediction MSE
plot(allk, allerrors, type = "b", xlab = "k", ylab = "Prediction MSE", main = "k vs. Prediction MSE")
# find the optimal k
optimal_k = allk[which.min(allerrors)]
optimal_k
## [1] 3
pred_optimal = myknn(trainx, trainy, testx, optimal_k)
# scatter plot of the predicted values vs. the true values
plot(testy, pred_optimal, xlab = "True Values", ylab = "Predicted Values", main = paste("Predicted vs. True Values (k =", optimal_k, ")"))
abline(0, 1, col = "red")
Similar to our previous homework, we can use a simulation study to
understand the bias and variance of KNN. To keep this simple, we will
consider just one testing point at \((0.5,
0.7, 1)\). Keep in mind that based on our understanding of the
bias-variance trade-off, we need to perform repeated simulations to get
estimates of the outcome at this point, and then calculate the bias and
variance across all simulations. Use the same model setting and training
data size as in Question 1. You should try to figure out the procedure
of this simulation based on the derivation we did in class. After
completing the simulation, you should be able to plot the bias, variance
and total error (bias\(^2\) + variance)
vs. k = seq(1, 29, 2)
in a single plot. Comment on the
findings.
Answer:
# set the seed
set.seed(432)
# define a new knn function for this question
myknn_multi <- function(trainx, trainy, testx, allk) {
pred = rep(NA, length(allk))
# calculate the distance between testx and all training observations
dist = rowSums(sweep(trainx, 2, testx, "-")^2)
orderdist = order(dist)
for (i in 1:length(allk)) {
# find the k nearest neighbors, I will used the squared distance here since the rank is the same
knn = orderdist[1: allk[i]]
# return the average of their y values as the prediction
pred[i] = mean(trainy[knn])
}
return(pred)
}
# perform the simulation
ntrain = 400
p = 3
ntest = 1
nsim = 1000
testpoint = matrix(c(0.5, 0.7, 1), nrow = 1)
allk = seq(1, 29, 2)
allpred = matrix(NA, nrow = nsim, ncol = length(allk))
for (i in 1:nsim) {
# generate the training data
trainx = matrix(rnorm(ntrain*p), nrow = ntrain, ncol = p)
trainy = 0.5*trainx[, 1] + sin(trainx[, 2]) - 0.3*trainx[, 3]^2 + rnorm(ntrain, 0, 0.1)
# predict the testing point
allpred[i, ] = myknn_multi(trainx, trainy, testpoint, allk)
}
# calculate the bias, variance and total error
true_value = 0.5*testpoint[1] + sin(testpoint[2]) - 0.3*testpoint[3]^2
bias = colMeans(allpred) - true_value
variance = apply(allpred, 2, var)
total_error = bias^2 + variance
# plot the bias, variance and total error
plot(allk, bias^2, type = "b", col = "blue", xlab = "k", ylab = "Error", ylim = c(0, max(total_error)), main = "Bias, Variance and Total Error vs. k")
lines(allk, variance, type = "b", col = "red")
lines(allk, total_error, type = "b", col = "green")
legend("topright", legend = c("Bias^2", "Variance", "Total Error"), col = c("blue", "red", "green"), lty = 1)
Let’s compare the performance of your KNN and Lasso. We will inherit the most of the setting as in Question 1 (training and testing sample size and model). However, make the following changes:
Compare the performance of KNN and Lasso in these two settings. You
should use the glmnet
package to fit a Lasso model. Use
cross-validation to get lambda.min
. For your KNN, use
k = 5
. You should report the prediction MSE on the testing
data for both methods in both settings. You only need to perform this
once for each setting. No repeated simulations are needed. Comment on
your findings,
Answer:
# set the seed
set.seed(432)
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-9
# Setting 1: p = 30 with independent covariates
ntrain = 400
ntest = 1000
n = ntrain + ntest
p = 30
X1 = matrix(rnorm(n*p), nrow = n, ncol = p)
Y1 = 0.5*X1[, 1] + sin(X1[, 2]) - 0.3*X1[, 3]^2 + rnorm(n, 0, 0.1)
trainx1 = X1[1:ntrain, ]
trainy1 = Y1[1:ntrain]
testx1 = X1[(ntrain+1):n, ]
testy1 = Y1[(ntrain+1):n]
# fit a Lasso model
lasso1 <- cv.glmnet(as.matrix(trainx1), trainy1, alpha = 1, nfolds = 5)
# predict the testing data using Lasso
pred_lasso1 <- predict(lasso1, newx = as.matrix(testx1), s = "lambda.min")
# calculate the prediction MSE for Lasso
mse_lasso1 <- mean((pred_lasso1 - testy1)^2)
# predict the testing data using KNN with k = 5
pred_knn1 <- myknn(trainx1, trainy1, testx1, k = 5)
# calculate the prediction MSE for KNN
mse_knn1 <- mean((pred_knn1 - testy1)^2)
# print the results for Setting 1
cat("Setting 1 (p = 30 with independent covariates):\n
Lasso MSE:", mse_lasso1, "\n
KNN MSE:", mse_knn1, "\n\n")
## Setting 1 (p = 30 with independent covariates):
##
## Lasso MSE: 0.2399099
##
## KNN MSE: 0.5893325
# Setting 2: p = 30 with correlated covariates
library(MASS)
# generate the covariance matrix
Sigma = matrix(0.8, nrow = p, ncol = p)
diag(Sigma) = 1
X2 = mvrnorm(n, mu = rep(0, p), Sigma = Sigma)
Y2 = 0.5*X2[, 1] + sin(X2[, 2]) - 0.3*X2[, 3]^2 + rnorm(n, 0, 0.1)
trainx2 = X2[1:ntrain, ]
trainy2 = Y2[1:ntrain]
testx2 = X2[(ntrain+1):n, ]
testy2 = Y2[(ntrain+1):n]
# fit a Lasso model
lasso2 <- cv.glmnet(as.matrix(trainx2), trainy2, alpha = 1, nfolds = 5)
# predict the testing data using Lasso
pred_lasso2 <- predict(lasso2, newx = as.matrix(testx2), s = "lambda.min")
# calculate the prediction MSE for Lasso
mse_lasso2 <- mean((pred_lasso2 - testy2)^2)
# predict the testing data using KNN with k = 5
pred_knn2 <- myknn(trainx2, trainy2, testx2, k = 5)
# calculate the prediction MSE for KNN
mse_knn2 <- mean((pred_knn2 - testy2)^2)
# print the results for Setting 2
cat("Setting 2 (p = 30 with correlated covariates):\n
Lasso MSE:", mse_lasso2, "\n
KNN MSE:", mse_knn2, "\n")
## Setting 2 (p = 30 with correlated covariates):
##
## Lasso MSE: 0.2338774
##
## KNN MSE: 0.171011
The MNIST dataset of handwritten digits is one of the most popular imaging data during the early times of machine learning development. Many machine learning algorithms have pushed the accuracy to over 99% on this dataset. We will use the first 2000 observations of this dataset. You can download this from our course website. The first column is the digits. This is a fairly balanced dataset.
load("mnist_first2000.RData")
dim(mnist)
## [1] 2000 785
Modify your KNN code for classification with multi-class labels. You
already did this for regression in Question 1. Now you need to output
the majority label among the k
nearest neighbors. For ties,
return a randomly selected label among the tied labels. Apply this
function to predict the label of the 501 - 2000th observation using the
first 500 observations as the training data. Use \(k = 10\) and report the prediction error as
a contingency table. Which digits are more likely to be
misclassified?
# set the seed
set.seed(432)
# define the myknn_classification function
myknn_class <- function(trainx, trainy, testx, k) {
ntest = nrow(testx)
pred = rep(NA, ntest)
for (i in 1:ntest) {
# calculate the distance between the i-th test observation and all training observations
dist = rowSums(sweep(trainx, 2, testx[i, ], "-")^2)
# find the k nearest neighbors
knn = order(dist)[1:k]
# return the majority label among the k nearest neighbors
knn_labels = trainy[knn]
label_counts = table(knn_labels)
max_count = max(label_counts)
majority_labels = names(label_counts[label_counts == max_count])
if (length(majority_labels) > 1) {
# tie case: randomly select one of the tied labels
pred[i] = sample(majority_labels, 1)
} else {
pred[i] = majority_labels
}
}
return(pred)
}
# split the data into training and testing sets
trainx_mnist = data.matrix(mnist[1:500, -1])
trainy_mnist = data.matrix(mnist[1:500, 1])
testx_mnist = data.matrix(mnist[501:2000, -1])
testy_mnist = data.matrix(mnist[501:2000, 1])
# predict the testing data using KNN with k = 10
pred_mnist = myknn_class(trainx_mnist, trainy_mnist, testx_mnist, k = 10)
# create a contingency table of true vs. predicted labels
contingency_table = table(True = testy_mnist, Predicted = pred_mnist)
print(contingency_table)
## Predicted
## True 0 1 2 3 4 5 6 7 8 9
## 0 124 1 3 1 0 8 3 0 1 0
## 1 0 154 0 0 0 0 0 0 0 0
## 2 2 31 84 1 7 3 3 9 4 2
## 3 2 11 1 106 0 10 0 1 4 5
## 4 0 12 0 0 118 1 2 0 0 29
## 5 3 5 1 24 1 92 6 1 0 9
## 6 1 17 1 0 4 4 128 0 0 0
## 7 0 13 0 0 4 1 0 136 0 18
## 8 1 12 0 15 2 2 3 1 81 16
## 9 2 4 1 3 5 0 2 3 0 135
# calculate the prediction error rate
error_rate = sum(diag(contingency_table)) / sum(contingency_table)
cat("Prediction Accuracy:", error_rate, "\n")
## Prediction Accuracy: 0.772
# identify which digits are more likely to be misclassified
# 2 is often misclassified as 1
as.numeric(which.min(diag(contingency_table) / rowSums(contingency_table)) - 1)
## [1] 2