Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to reinstall all of your packages.The MNIST dataset of handwritten digits is one of the most popular imaging data during the early times of machine learning development. Many machine learning algorithms have pushed the accuracy to over 99% on this dataset. We will download the first 2000 observations of this dataset from an online resource using the following code. The first column is the digits. This is a fairly balanced dataset.
# readin the data
mnist <- read.csv("https://pjreddie.com/media/files/mnist_train.csv", nrows = 2000)
colnames(mnist) = c("Digit", paste("Pixel", seq(1:784), sep = ""))
save(mnist, file = "mnist_first2000.RData")
# you can load the data with the following code
# load("mnist_first2000.RData")
dim(mnist)
## [1] 2000 785
[20 pts] The first question is to write your own KNN model. Please review our lecture nodes on how the KNN model works and complete the following task:
mydist(x0, trainx)
to calculate the
Euclidean distance between a target point covaraite vector
x0
and a data matrix trainx
. You are not
allowed to use any existing functions in R
to calculate the
distance. The function should return a vector of length \(n\), where \(n\) is the number of rows in
trainx
.myknn(x0, trainx, trainy, k)
to find the cloest
k
neighbors of x0
. The function should return
a vector of length k
consisting the class labels of these
neighbors. The function should take the following arguments:
x0
is a covariate vector of the target pointtrainx
: a data matrix of training covariatestrainy
: a vector of training labelsk
: the number of neighbors to use # define the mydist(x0, trainx) function
mydist <- function(x0, trainx) {
# calculate the distance between x0 and each row of trainx
apply(trainx, 1, function(x) sqrt(sum((x0 - x)^2)))
}
# define the myknn(x0, trainx, trainy, k) function
myknn <- function(x0, trainx, trainy, k) {
# calculate the distance between x0 and each row of trainx
dist <- mydist(x0, trainx)
# find the k nearest neighbors
knn <- order(dist)[1:k]
# return the majority label
return(trainy[knn])
}
# predict the label of the 101th observation
myknn(mnist[101, -1], mnist[1:100, -1], mnist[1:100, 1], 10)
## [1] 7 7 7 7 9 9 4 7 7 7
# the true label of the 101th observation
mnist[101, 1]
## [1] 7
Based on the result, the predicted label is 7, which is correct.
caret
package to fit a KNN
classification model. Use the first 1000 observations as the training
data to tune parameters and fit the model.
caret
package and its train()
functionk
values. Make your own
choice and use cross-validation to select the best k
. What
is the criterion you use for this selection? # load the caret package
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
# fit a KNN model
knn <- train(as.factor(Digit) ~ ., data = mnist[1:1000, ], method = "knn",
trControl = trainControl(method = "cv", number = 5),
tuneGrid = data.frame(k = c(1:15)))
# predict the testing data
pred <- predict(knn, newdata = mnist[1001:2000, ])
# calculate the confusion matrix
table(pred, mnist[1001:2000, 1])
##
## pred 0 1 2 3 4 5 6 7 8 9
## 0 89 0 0 1 0 1 1 0 2 1
## 1 1 103 4 1 6 0 0 1 1 1
## 2 0 0 86 4 1 0 0 0 0 0
## 3 0 0 0 76 0 6 0 0 1 1
## 4 0 0 1 0 81 0 1 0 1 5
## 5 0 0 0 11 1 76 1 2 3 0
## 6 2 0 0 0 2 3 103 0 3 0
## 7 0 1 5 1 1 0 0 101 0 5
## 8 0 0 3 2 1 1 0 0 70 0
## 9 1 0 0 2 16 2 0 3 4 97
# the prediction classification error
1-sum(diag(table(pred, mnist[1001:2000, 1]))/1000)
## [1] 0.118
glmnet
package to fit a multi-class logistic
regression model with Lasso penalty. Use cross-validation to select the
best tuning parameter. # load the glmnet package
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-7
# fit a logistic regression model with Lasso penalty
lasso <- cv.glmnet(as.matrix(mnist[1:1000, -1]), as.factor(mnist[1:1000, 1]),
family = "multinomial", type.measure = "class",
alpha = 1, nfolds = 5)
plot(lasso)
# predict the testing data
pred <- predict(lasso, newx = as.matrix(mnist[1001:2000, -1]),
s = "lambda.min", type = "class")
# calculate the confusion matrix
table(pred, mnist[1001:2000, 1])
##
## pred 0 1 2 3 4 5 6 7 8 9
## 0 84 0 1 0 0 0 2 0 3 4
## 1 0 90 3 1 1 0 2 1 3 0
## 2 0 1 77 12 3 0 4 0 3 0
## 3 1 1 1 64 1 1 0 3 1 1
## 4 1 0 2 0 83 1 1 2 1 11
## 5 4 0 0 9 2 75 2 0 2 1
## 6 0 1 6 2 2 2 93 1 1 0
## 7 0 1 5 2 3 2 0 92 2 5
## 8 3 10 3 5 2 7 2 1 66 2
## 9 0 0 1 3 12 1 0 7 3 86
# the prediction classification error
1-sum(diag(table(pred, mnist[1001:2000, 1]))/1000)
## [1] 0.19
For this question, let’s setup a simulation study. We will consider a setting with some latent structures. You should generate \(n = 400\) observations with \(p = 100\). Use the first 200 as the training data and the rest as testing data. Perform the following steps:
Hence, the expected outcome depends only on the first two latent variables. The goal of this experiment is to observe how the KNN model could be affected by the dimensionality of the latent space. Please keep in mind that you do not observe \(Z\) in a real world, and you can only observe \(X\) and \(\mathbf{y}\). Perform the following:
[20 pts] Fit a KNN regression using the generated data with \(m = 3\), and predict the corresponding
testing data. Vary \(k\) in a grid
seq(2, 82, 8)
. What are the testing errors? Repeat this
experiment with \(m = 30\), and report
the testing errors.
[20 pts] Now let’s perform a simulation study that repeats 50 times. You should re-generate \(A\) each time. Average your prediction errors (of each \(k\)) over the simulation runs, just like the simulations we have done in previous HW. At the end, you should show a plot that summarizes and compare the prediction errors of two different settings, respectively. For example, using \(k\) as the horizontal axis, and prediction errors in the vertical axis, with two line, each representing a setting of \(m\).
[10 points] In both settings, we are still using 100 variables to fit the KNN model, but the performences are very different. Can you comment on the results? Which setting is easier for KNN to obtain a better fit? Why is that?
# set the seed
set.seed(432)
library(kknn)
##
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
##
## contr.dummy
# write a function that could perform the simulation
m = 3
allk = seq(2, 82, 4)
calculate_errors <- function(m, allk)
{
n = 400
p = 100
# generate the A matrix
A = matrix(runif(p*m, -2, 2), nrow = p, ncol = m)
# generate the Z matrix
Z = matrix(rnorm(n*m), nrow = n, ncol = m)
# generate the eta matrix
eta = matrix(rnorm(n*p), nrow = n, ncol = p)
# generate the X matrix
X = Z %*% t(A) + eta
# generate the Y vector
Y = Z[, 1] + Z[, 2] + rnorm(n)
# split the data into training and testing
traindata = data.frame(Y, X)[1:200, ]
testdata = data.frame(Y, X)[201:400, ]
allerrors = rep(NA, length(allk))
for (j in 1:length(allk))
{
# fit a KNN model using the kknn package
knn = kknn(Y ~ ., traindata, testdata, k = allk[j])
allerrors[j] = mean( (knn$fitted.values - testdata$Y)^2 )
}
return(allerrors)
}
# perform the simulation
nsim = 50
allerrors3 = matrix(NA, nrow = nsim, ncol = length(allk))
allerrors30 = matrix(NA, nrow = nsim, ncol = length(allk))
for (i in 1:nsim)
{
allerrors3[i, ] = calculate_errors(3, allk)
allerrors30[i, ] = calculate_errors(30, allk)
}
plot(allk, colMeans(allerrors3), type = "l", col = "blue",
xlab = "k", ylab = "testing error",
ylim = c(1, 4), main = "m = 3 vs m = 30")
lines(allk, colMeans(allerrors30), col = "red")
legend("topright", legend = c("m = 3", "m = 30"), col = c("blue", "red"), lty = 1)