Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: Regression KNN and Bias-Variance Trade-Off (20 pts)

In our previous homework, we only used the prediction errors to evaluate the performance of a model. Now we have learned how to break down the bias-variance trade-off theoretically, and showed some simulation ideas to validate that in class. Let’s perform a thorough investigation. For this question, we will use a simulated regression model to estimate the bias and variance, and then validate our formula. Our simulation is based on this following model:

\[ Y = \exp(\beta^T x) + \epsilon \]

where \(\beta = c(0.5, -0.5, 0)\), \(X\) is generated uniformly from \([0, 1]^3\), and \(\epsilon\) follows i.i.d. standard Gaussian. We will generate some training data and our goal is to predict a testing point at \(x_0 = c(1, -0.75, -0.7)\).

  1. [1 pt] What is the true mean of \(Y\) at this testing point \(x_0\)? Calculate it in R.

  2. [5 pts] For this question, you need to write your own code for implementing KNN, rather than using any built-in functions in R. Generate 100 training data points and calculate the KNN prediction of \(x_0\) with \(k = 21\). Use the Euclidean distance as the distance metric. What is your prediction? Validate your result with the knn.reg function from the FNN package.

  3. [5 pts] Now we will estimate the bias of the KNN model for predicting \(x_0\). Use the KNN code you developed in the previous question. To estimate the bias, you need to perform a simulation that repeats 1000 times. Keep in mind that the bias of a model is defined as \(E[\widehat f(x_0)] - f(x_0)\). Use the same sample size \(n = 100\) and same \(k = 21\), design your own simulation study to estimate this.

  4. [2 pt] Based on your previous simulation, without generating new simulation results, can you estimate the variance of this model? The variance of a model is defined as \(E[(\widehat f(x_0) - E[\widehat f(x_0)])^2]\). Calculate and report the value.

  5. [2 pts] Recall that our prediction error (using this model of predicted probability with knn) can be decomposed into the irreducible error, bias, and variance. Without performing additional simulations, can you calculate each of them based on our model and the previous simulation results? Hence what is your calculated prediction error?

  6. [5 pts] The last step is to validate this result. To do this, you should generate a testing data \(Y_0\) using \(x_0\) in each of your simulation run, and calculate the prediction error. Compare this result with your theoritical calculation.

Question 2: Logistic Regression (30 points)

Load the library ISLR2. From that library, load the dataset named Default. Set the seed to 7 again within the chunk. Divide the dataset into a training and testing dataset. The test dataset should contain 1000 rows, the remainder should be in the training dataset.

  # load library
  library(ISLR2)
  
  # load data
  data(Default)
  
  # set seed
  set.seed(7)
  
  # number of rows in entire dataset
  defaultNumRows <- dim(Default)[1]
  defaultTestNumRows <- 1000
  
  # separate dataset into train and test
  test_idx <- sample(x = 1:defaultNumRows, size = defaultTestNumRows)
  Default_train <- Default[-test_idx,]
  Default_test <- Default[test_idx,]
  1. [10 pts] Using the glm() function on the training dataset to fit a logistic regression model for the variable default using the input variables balance and income. Write a function called loglikelihood that calculates the log-likelihood for a set of coefficients (You can refer to the lecture notes). There are three input arguments for this function: a vector of coefficients (beta), input data matrix (X), and input class labels (Y). The output for this function is a numeric, the log likelihood (output_loglik). Plug in the estimated coefficients from the glm() model and calculate the maximum log likelihood and report it. Then, get the deviance value directly from the glm() object output. What is the relationship of deviance and maximum log likelihood?

  2. [10 pts] Use the model fit on the training dataset to estimate the probability of default for the test dataset. Use 3 different cutoff values: 0.3, 0.5, 0.7 to predict classes. For each cutoff value, print the confusion matrix. For each cutoff value, calculate and report the test error, sensitivity, specificity, and precision without using any R functions, just the addition/subtract/multiply/divide operators. Which cutoff value do you prefer in this case? If our goal is to capture as many people who will default as possible (without concerning misclassify people as Default=Yes even if they will not default), which cutoff value should we use?

  3. [5 pts] Load the library ROCR. Using the functions in that library, plot the ROC curve and calculate the AUC. Use the ROC curve to determine a cutoff value and comment on your reasoning.

  4. [5 pts] Load the library glmnet. Using the cv.glmnet() function, do 20-fold cross-validation on the training dataset to determine the optimal penalty coefficient, \(\lambda\), in the logistic regression with ridge penalty. In order to choose the best penalty coefficient use AUC as the Cross-Validation metric.

Question 3: K-Nearest Neighbors for Multi-class Classification (50 pts)

The MNIST dataset of handwritten digits is one of the most popular imaging data during the early times of machine learning development. Many machine learning algorithms have pushed the accuracy to over 99% on this dataset. The dataset is stored in an online repository in CSV format, https://pjreddie.com/media/files/mnist_train.csv. We will download the first 2500 observations of this dataset from an online resource using the following code. The first column is the digits. The remaining columns are the pixel values. After we download the dataset, we save it to our local disk so we do not have to re download the data in the future.

  # inputs to download file
  fileLocation <- "https://pjreddie.com/media/files/mnist_train.csv"
  numRowsToDownload <- 2500
  localFileName <- paste0("mnist_first", numRowsToDownload, ".RData")
  
  # download the data and add column names
  mnist <- read.csv(fileLocation, nrows = numRowsToDownload)
  numColsMnist <- dim(mnist)[2]
  colnames(mnist) <- c("Digit", paste("Pixel", seq(1:(numColsMnist - 1)), sep = ""))
  
  # save file
  # in the future we can read in from the local copy instead of having to redownload
  save(mnist, file = localFileName)
  
  # you can load the data with the following code
  load(file = localFileName)
  1. [20 pts] The first task is to write the code to implement the K-Nearest Neighbors, or KNN, model from scratch. We will do this in steps:

    • Write a function called euclidean_distance that calculates the Euclidean distance between two vectors. There are two input arguments for this function: vector 1 (vec1), and vector 2 (vec2). The output for this function is a numeric, the Euclidean distance (euclDist).

    • Write a function called manhattan_distance that calculates the Manhattan distance between two vectors. There are two input arguments for this function: vector 1 (vec1), and vector 2 (vec2). The output for this function is a numeric, the Manhattan distance (manhDist).

    • Write a function called euclidean_distance_all that calculates the Euclidean distance between a vector and all the row vectors in an input data matrix. There are two input arguments for this function: a vector (vec1) and an input data matrix (mat1_X). The output for this function is a vector (output_euclDistVec) which is of the same length as the number of rows in mat1_X. This function must use the function euclidean_distance you previously wrote.

    • Write a function called manhattan_distance_all that calculates the Manhattan distance between a vector and all the row vectors in an input data matrix. There are two input arguments for this function: a vector (vec1) and an input data matrix (mat1_X). The output for this function is a vector (output_manhattanDistVec) which is of the same length as the number of rows in mat1_X. This function must use the function manhattan_distance you previously wrote.

    • Write a function called my_KNN that compares a vector to a matrix and finds its K-nearest neighbors. There are five input arguments for this function: vector 1 (vec1), the input data matrix (mat1_X), the class labels corresponding to each row of the matrix (mat1_Y), the number of nearest neighbors you are interested in finding (K), and a Boolean argument specifying if we are using the Euclidean distance (euclDistUsed). The argument K should be a positive integer. If the argument euclDistUsed = TRUE, then use the Euclidean distance. Otherwise, use the Manhattan distance. The output of this function is a list of length 2 (output_knnMajorityVote). The first element in the output list should be a vector of length K containing the class labels of the closest neighbors. The second element in the output list should be the majority vote of the K class labels in the first element of the list. The function must use the functions euclidean_distance and manhattan_distance you previously wrote.

    Apply this function to predict the label of the \(123^{\text{rd}}\) observation using the first \(100\) observations as your input training data matrix. Use \(K = 10\). What is the predicted label when you use Euclidean distance? What is the predicted label when you use Manhattan distance? Are these predictions correct?

  2. [20 pts] Set the seed to 7 at the beginning of the chunk. Let’s now use 20-fold cross-validation to select the best \(K\). Now, load the the library caret. We will use the trainControl and train functions from this library to fit a KNN classification model. The \(K\) values we will consider are \(1\), \(5\), \(10\), \(20\), \(50\), \(100\). Be careful to not get confused between the number of folds and number of nearest neighbors when using the functions. Use the first \(1250\) observations as the training data to fit each model. Compare the results. What is the best \(K\) according to cross-validation classification accuracy? Once you have chosen \(K\), fit a final KNN model on your entire training dataset with that value. Use that model to predict the classes of the last \(1250\) observations, which is our test dataset. Report the prediction confusion matrix on the test dataset for your final KNN model. Calculate the the test error and the sensitivity of each classes.

  3. [10 pts] Set the seed to 7 at the beginning of the chunk. Now let’s try to use multi-class (i.e., multinomial) logistic regression to fit the data. Use the first 1250 observations as the training data and the rest as the testing data. Load the library glmnet. We will use a multi-class logistic regression model with a Lasso penalty. First, we seek to find an almost optimal value for the \(\lambda\) penalty parameter. Use the cv.glmnet function with \(20\) folds on the training dataset to find \(\lambda_{1se}\). Once you have identified \(\lambda_{1se}\), use the glmnet() function with that penalty value to fit a multi-class logistic regression model onto the entire training dataset. Ensure you set the argument family = multinomial within the functions as appropriate. Using that model, predict the class label for the testing data. Report the testing data prediction confusion matrix. What is the test error?