Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to reinstall all of your packages.In our previous homework, we only used the prediction errors to evaluate the performance of a model. Now we have learned how to break down the bias-variance trade-off theoretically, and showed some simulation ideas to validate that in class. Let’s perform a thorough investigation. For this question, we will use a simulated regression model to estimate the bias and variance, and then validate our formula. Our simulation is based on this following model:
\[ Y = \exp(\beta^T x) + \epsilon \]
where \(\beta = c(0.5, -0.5, 0)\), \(X\) is generated uniformly from \([0, 1]^3\), and \(\epsilon\) follows i.i.d. standard Gaussian. We will generate some training data and our goal is to predict a testing point at \(x_0 = c(1, -0.75, -0.7)\).
[1 pt] What is the true mean of \(Y\) at this testing point \(x_0\)? Calculate it in R.
[5 pts] For this question, you need to write your own
code for implementing KNN, rather than using any built-in
functions in R. Generate 100 training data points and calculate the KNN
prediction of \(x_0\) with \(k = 21\). Use the Euclidean distance as the
distance metric. What is your prediction? Validate your result with the
knn.reg
function from the FNN
package.
[5 pts] Now we will estimate the bias of the KNN model for predicting \(x_0\). Use the KNN code you developed in the previous question. To estimate the bias, you need to perform a simulation that repeats 1000 times. Keep in mind that the bias of a model is defined as \(E[\widehat f(x_0)] - f(x_0)\). Use the same sample size \(n = 100\) and same \(k = 21\), design your own simulation study to estimate this.
[2 pt] Based on your previous simulation, without generating new simulation results, can you estimate the variance of this model? The variance of a model is defined as \(E[(\widehat f(x_0) - E[\widehat f(x_0)])^2]\). Calculate and report the value.
[2 pts] Recall that our prediction error (using this model of predicted probability with knn) can be decomposed into the irreducible error, bias, and variance. Without performing additional simulations, can you calculate each of them based on our model and the previous simulation results? Hence what is your calculated prediction error?
[5 pts] The last step is to validate this result. To do this, you should generate a testing data \(Y_0\) using \(x_0\) in each of your simulation run, and calculate the prediction error. Compare this result with your theoritical calculation.
Load the library ISLR2
. From that library, load the
dataset named Default
. Set the seed to 7 again within the
chunk. Divide the dataset into a training and testing dataset. The test
dataset should contain 1000 rows, the remainder should be in the
training dataset.
# load library
library(ISLR2)
# load data
data(Default)
# set seed
set.seed(7)
# number of rows in entire dataset
defaultNumRows <- dim(Default)[1]
defaultTestNumRows <- 1000
# separate dataset into train and test
test_idx <- sample(x = 1:defaultNumRows, size = defaultTestNumRows)
Default_train <- Default[-test_idx,]
Default_test <- Default[test_idx,]
[10 pts] Using the glm()
function on the training
dataset to fit a logistic regression model for the variable
default
using the input variables balance
and
income
. Write a function called loglikelihood
that calculates the log-likelihood for a set of coefficients (You can
refer to the lecture notes). There are three input arguments for this
function: a vector of coefficients (beta
), input data
matrix (X
), and input class labels (Y
). The
output for this function is a numeric, the log likelihood
(output_loglik
). Plug in the estimated coefficients from
the glm()
model and calculate the maximum log likelihood
and report it. Then, get the deviance
value directly from
the glm()
object output. What is the relationship of
deviance and maximum log likelihood?
[10 pts] Use the model fit on the training dataset to estimate
the probability of default for the test dataset. Use 3 different cutoff
values: 0.3, 0.5, 0.7 to predict classes. For each cutoff value, print
the confusion matrix. For each cutoff value, calculate and report the
test error, sensitivity, specificity, and precision without using any R
functions, just the addition/subtract/multiply/divide operators. Which
cutoff value do you prefer in this case? If our goal is to capture as
many people who will default as possible (without concerning misclassify
people as Default=Yes
even if they will not default), which
cutoff value should we use?
[5 pts] Load the library ROCR
. Using the functions
in that library, plot the ROC curve and calculate the AUC. Use the ROC
curve to determine a cutoff value and comment on your
reasoning.
[5 pts] Load the library glmnet
. Using the
cv.glmnet()
function, do 20-fold cross-validation on the
training dataset to determine the optimal penalty coefficient, \(\lambda\), in the logistic regression with
ridge penalty. In order to choose the best penalty coefficient use AUC
as the Cross-Validation metric.
The MNIST dataset of handwritten digits is one of the most popular
imaging data during the early times of machine learning development.
Many machine learning algorithms have pushed the accuracy to over 99% on
this dataset. The dataset is stored in an online repository in CSV
format, https://pjreddie.com/media/files/mnist_train.csv
.
We will download the first 2500 observations of this dataset from an
online resource using the following code. The first column is the
digits. The remaining columns are the pixel values. After we download
the dataset, we save it to our local disk so we do not have to re
download the data in the future.
# inputs to download file
fileLocation <- "https://pjreddie.com/media/files/mnist_train.csv"
numRowsToDownload <- 2500
localFileName <- paste0("mnist_first", numRowsToDownload, ".RData")
# download the data and add column names
mnist <- read.csv(fileLocation, nrows = numRowsToDownload)
numColsMnist <- dim(mnist)[2]
colnames(mnist) <- c("Digit", paste("Pixel", seq(1:(numColsMnist - 1)), sep = ""))
# save file
# in the future we can read in from the local copy instead of having to redownload
save(mnist, file = localFileName)
# you can load the data with the following code
load(file = localFileName)
[20 pts] The first task is to write the code to implement the K-Nearest Neighbors, or KNN, model from scratch. We will do this in steps:
Write a function called euclidean_distance
that
calculates the Euclidean distance between two vectors. There are two
input arguments for this function: vector 1 (vec1
), and
vector 2 (vec2
). The output for this function is a numeric,
the Euclidean distance (euclDist
).
Write a function called manhattan_distance
that
calculates the Manhattan distance between two vectors. There are two
input arguments for this function: vector 1 (vec1
), and
vector 2 (vec2
). The output for this function is a numeric,
the Manhattan distance (manhDist
).
Write a function called euclidean_distance_all
that
calculates the Euclidean distance between a vector and all the row
vectors in an input data matrix. There are two input arguments for this
function: a vector (vec1
) and an input data matrix
(mat1_X
). The output for this function is a vector
(output_euclDistVec
) which is of the same length as the
number of rows in mat1_X
. This function must use the
function euclidean_distance
you previously wrote.
Write a function called manhattan_distance_all
that
calculates the Manhattan distance between a vector and all the row
vectors in an input data matrix. There are two input arguments for this
function: a vector (vec1
) and an input data matrix
(mat1_X
). The output for this function is a vector
(output_manhattanDistVec
) which is of the same length as
the number of rows in mat1_X
. This function must use the
function manhattan_distance
you previously wrote.
Write a function called my_KNN
that compares a
vector to a matrix and finds its K-nearest neighbors. There are five
input arguments for this function: vector 1 (vec1
), the
input data matrix (mat1_X
), the class labels corresponding
to each row of the matrix (mat1_Y
), the number of nearest
neighbors you are interested in finding (K
), and a Boolean
argument specifying if we are using the Euclidean distance
(euclDistUsed
). The argument K
should be a
positive integer. If the argument euclDistUsed = TRUE
, then
use the Euclidean distance. Otherwise, use the Manhattan distance. The
output of this function is a list of length 2
(output_knnMajorityVote
). The first element in the output
list should be a vector of length K
containing the class
labels of the closest neighbors. The second element in the output list
should be the majority vote of the K
class labels in the
first element of the list. The function must use the functions
euclidean_distance
and manhattan_distance
you
previously wrote.
Apply this function to predict the label of the \(123^{\text{rd}}\) observation using the first \(100\) observations as your input training data matrix. Use \(K = 10\). What is the predicted label when you use Euclidean distance? What is the predicted label when you use Manhattan distance? Are these predictions correct?
[20 pts] Set the seed to 7 at the beginning of the chunk. Let’s
now use 20-fold cross-validation to select the best \(K\). Now, load the the library
caret
. We will use the trainControl
and
train
functions from this library to fit a KNN
classification model. The \(K\) values
we will consider are \(1\), \(5\), \(10\), \(20\), \(50\), \(100\). Be careful to not get confused
between the number of folds and number of nearest neighbors when using
the functions. Use the first \(1250\)
observations as the training data to fit each model. Compare the
results. What is the best \(K\)
according to cross-validation classification accuracy? Once you have
chosen \(K\), fit a final KNN model on
your entire training dataset with that value. Use that model to predict
the classes of the last \(1250\)
observations, which is our test dataset. Report the prediction confusion
matrix on the test dataset for your final KNN model. Calculate the the
test error and the sensitivity of each classes.
[10 pts] Set the seed to 7 at the beginning of the chunk. Now
let’s try to use multi-class (i.e., multinomial) logistic regression to
fit the data. Use the first 1250 observations as the training data and
the rest as the testing data. Load the library glmnet
. We
will use a multi-class logistic regression model with a Lasso penalty.
First, we seek to find an almost optimal value for the \(\lambda\) penalty parameter. Use the
cv.glmnet
function with \(20\) folds on the training dataset to find
\(\lambda_{1se}\). Once you have
identified \(\lambda_{1se}\), use the
glmnet()
function with that penalty value to fit a
multi-class logistic regression model onto the entire training dataset.
Ensure you set the argument family = multinomial
within the
functions as appropriate. Using that model, predict the class label for
the testing data. Report the testing data prediction confusion matrix.
What is the test error?