Instruction
Question 1 [35 Points] Local Linear Regression
Question 2 [35 Points] Linear Discriminant Analysis
- Question 3 [30 points] Regularized quadratic discriminate analysis

Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to compass2g. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

What is expected for the submission to Gradescope
- You are required to submit one rendered PDF file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file generated by a .Rmd file. .html format cannot be accepted.
- Please follow the instructions on Gradescope to select corresponding PDF pages for each question.
Please note that your homework file is a PDF report instead of a messy collection of R codes. This report should include:
- Your Name and NetID. (Replace Ruoqing Zhu (rqzhu) by your name and NetID if you are using this template).
- Make all of your R code chunks visible for grading.
- Relevant outputs from your R code chunks that support your answers.
- Provide clear answers or conclusions for each question. For example, you could start with Answer: I fit SVM with the following choice of tuning parameters ...
- Many assignments require your own implementation of algorithms. Basic comments are strongly encouraged to explain the logic to our graders. However, line-by-line code comments are unnecessary.
Requirements regarding the .Rmd file.
- You do NOT need to submit Rmd files. However, your PDF file should be rendered directly from it.
- Make sure that you set random seeds for simulation or randomized algorithms so that the results are reproducible. If a specific seed number is not provided in the homework, you can consider using your NetID.
- For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.

Question 1 [35 Points] Local Linear Regression

We have implemented the Nadaraya-Watson kernel estimator in HW 6. In this question, we will investigate a local linear regression: \[ \widehat{f}\left(x\right)=\widehat{\beta}_{0}\left(x\right)+ \widehat{\beta}_{1}\left(x\right) x, \] where \(x\) is a testing point. Local coefficients \(\widehat{\beta}_{r}\left(x \right)\) for \(r=0, 1\) are obtained by minimizing the object function \[ \underset{\beta_{0}(x), \, \beta_{1}(x)}{\operatorname{minimize}} \quad \sum_{i=1}^{n} K_{\lambda} \left(x, x_{i}\right) \Big[y_{i}-\beta_{0}(x) - \beta_1(x) x_{i} \Big]^{2}. \]

In this question, we will use the Gaussian kernel \(K(u) = \frac{1}{\sqrt{2 \pi}} e^{- \frac{u^2}{2}}\).

[20 pts] Write a function myLocLinear(trainX, trainY, testX, lambda), where lambda is the bandwidth and testX is all testing samples. This function returns predictions on testX. The solution of \(\beta_{0}(x)\) and \(\beta_{1}(x)\) can be obtained by fitting a weighted linear regression. The formula is provided on Page 25 of our lecture note.
[15 pts] Fit a local linear regression with our given training data. The testing data are generated using the code given below. Try a set of bandwidth \(\lambda = 0.05, 0.1, \ldots, 0.55, 0.6\) when calculating the kernel function.

Provide a plot of testing MSE vs \(\lambda\). Does your plot show a “U” shape?
Report the best testing MSE with the corresponding \(\lambda\).
Plot three figures of your fitted testing data curve, with \(\lambda = 0.05, 0.25\), and \(0.5\). Add the true function curve (see the following code for generating the truth) and the training data points onto this plot. Label each \(\lambda\) and your curves. Comment on the the shape of fitted curves as your \(\lambda\) changes.

  train = read.csv('hw7_Q1_train.csv')
  testX = 2 * pi * seq(0, 1, by = 0.01)
  testY = sin(testX)
  plot(train$x, train$y, pch = 19, cex = 0.3, xlab = "x", ylab = "y")
  lines(testX, testY, col = "darkorange", lwd=2.0)
  legend("topright", legend = c("Truth"), 
         col = c("darkorange"), lty = 1, cex = 2)

Solution

  # set.seed(1)
  # n = 500
  # trainX = runif(n) * 2 * pi
  # trainY = sin(trainX) + 0.2 * rnorm(n)
  # write.csv(data.frame(x = trainX, y = trainY), 'hw7_Q1_train.csv', row.names = FALSE)

  # Gaussian Kernel
  GaussianKernel <- function(x, y, bw){
    exp(- ((x - y)/bw)^2 / 2)
  }

  myLocLinear <- function(trainX, trainY, testX, lambda){
    
    nTest = length(testX)
    pred  = rep(NA, nTest)

    # the design matrix
    X = cbind(1, trainX)
    
    for(i in 1:nTest){    
    
      # get kernel weights for this testing sample
      K = GaussianKernel(testX[i], trainX, lambda)
    
      # (XWX)^(-1) XWY
      WX = sweep(X, 1, STATS = K, FUN = "*")
      
      # get beta
      beta = solve(t(X) %*% WX) %*% t(WX) %*% trainY
    
      # get prediction
      pred[i] = beta[1] + beta[2] * testX[i]
      
    }
    
    return(pred)
  }

  lambda_all = seq(0.05, 0.6, 0.05)
  n_lambda = length(lambda_all)
  test_mse = rep(0, n_lambda)
  pred_mat = matrix(NA, nrow = length(testX), ncol = n_lambda)
  
  for(i in 1:n_lambda){
    pred = myLocLinear(train$x, train$y, testX, lambda = lambda_all[i])
    test_mse[i] = mean((pred - testY)^2)
    pred_mat[, i] = pred
  }
  plot(lambda_all, test_mse, xlab = "lambda", ylab = "test MSE")

Hence, the minimum testing error is 0.0015754, and the best bandwidth \(\lambda\) is 0.25.

  par(mfrow = c(1, 3))
  plot.idx = c(1, 5, 10)
  
  for(i in plot.idx){
    plot(train$x, train$y, xlab = "x", ylab = "y", pch = 19, cex = 0.3,
         main = paste0("Lambda = ", lambda_all[i]))
    lines(testX, sin(testX), col = "darkorange", lwd=2.0)
    lines(testX, pred_mat[, i], col = "deepskyblue", lwd=2.0)
    legend("topright", legend = c("Truth", "Fitted curve"), 
           col = c("darkorange", "deepskyblue"), lty = 1)
  }

Comment: The curve in the left panel is under-smoothed while that in the right panel is over-smoothed.

Question 2 [35 Points] Linear Discriminant Analysis

For both question 2 and 3, you need to write your own code. We will use the handwritten digit recognition data from the ElemStatLearn package. We only consider the train-test split, with the pre-defined zip.train and zip.test. Simply use zip.train as the training data, and zip.test as the testing data for all evaluations and tuning. No cross-validation is needed in the training process.

The data consists of 10 classes: digits 0 to 9 and 256 features (\(16 \times 16\) grayscale image).
More information can be attained by code help(zip.train).

  library(ElemStatLearn)

  # load training and testing data
  dim(zip.train)

## [1] 7291  257

  dim(zip.test)

## [1] 2007  257

  # number of each digit
  table(zip.train[, 1])

## 
##    0    1    2    3    4    5    6    7    8    9 
## 1194 1005  731  658  652  556  664  645  542  644

[10 pts] Estimate the mean, covariance matrix of each class and pooled covariance matrix. Basic built-in R functions such as cov are allowed. Do NOT print your results.
[15 pts] Write your own linear discriminate analysis (LDA) code following our lecture note. To perform this, you should calculate \(\mu_k\), \(\pi_k\), and \(\Sigma\) from the data. You may consider saving \(\mu_k\)’s and \(\pi_k\)’s as a list (with 10 elements in each list).

You are not required to write a single function to perform LDA, but you could consider defining a function as myLDA(testX, mu_list, sigma_pool), where mu_list is the estimated mean vector for each class, and sigma_pool is the pooled variance estimation. This function should return the predicted class based on comparing discriminant functions \(\delta_k(x) = w_k^T x + b_k\) given on page 32 of the lecture note.

[10 pts] Fit LDA model on the training data and predict with the testing data.

Report the first 5 entries of the \(w\) coefficient vector and \(b\) for digit 0.
Report a \(10 \times 10\) confusion matrix, where each column is true digit and each row is your predicted digit. You can use the table() function in R.
Report a table of misclassification rate of each (true) digit. Hence, this is the \(1 -\) sensitivity of each digit in a multi-class problem. Only keep the first three digits after the decimal point for the rate. Also report the overall mis-classification rate.

Answer:

Part a):

  # define training X and Y
  trainX = zip.train[, -1]
  trainY = zip.train[, 1]
  N = nrow(zip.train)
  count = table(zip.train[, 1])
  p = count / N
  
  # calculate all components
  mu_list = list()
  sigma_list = list()
  sigma = 0
  
  # estimate mean
  for(k in 1:10){
    mu_list[[k]] = colMeans(trainX[trainY == k-1, ])
    sigma_list[[k]] = cov(trainX[trainY == k-1, ])
    sigma = sigma + sigma_list[[k]]*(count[k] - 1)
  }
  
  # the pooled variance
  sigma_pool = sigma / (N - 10)

Part b):

  myLDA <- function(testX, mu_list, sigma_pool){

    sigma_inv = solve(sigma_pool)
    
    # evaluate the coefficients
    DFun = matrix(NA, nrow(testX), length(mu_list))
    
    for (k in 1:10){
      w = sigma_inv %*% mu_list[[k]]
      b = as.numeric(-0.5 * t(mu_list[[k]]) %*% sigma_inv %*% mu_list[[k]] + log(p[k]))
      
      DFun[, k] = testX %*% as.vector(w) + b
    }
    
    return(apply(DFun, MARGIN = 1, which.max) - 1)
  }

Part c):

  LDA_fit = myLDA(zip.test[, -1], mu_list, sigma_pool)

  confusion = table(LDA_fit, zip.test[, 1])
  knitr::kable(confusion)

	0	1	2	3	4	5	6	7	8	9
0	342	0	7	3	1	6	1	0	5	0
1	0	251	2	0	4	0	0	1	0	0
2	0	0	157	3	6	0	3	0	2	0
3	4	2	4	142	0	16	0	2	11	0
4	3	5	12	3	174	3	3	7	7	4
5	1	0	2	9	0	125	3	0	4	0
6	5	3	1	0	2	0	157	0	0	0
7	0	0	1	1	2	0	0	129	0	5
8	3	1	12	4	1	5	3	1	135	3
9	1	2	0	1	10	5	0	7	2	165

  # the first 5 entries of digit 0 is 
  sigma_inv = solve(sigma_pool)
  (sigma_inv %*% mu_list[[1]])[1:5]

## [1] -549.553021   68.575047  -39.112632   -3.095659   -9.656750

  # b is 
  as.numeric(-0.5 * t(mu_list[[1]]) %*% sigma_inv %*% mu_list[[1]] + log(p[1]))

## [1] -1156.443

\(1 -\) Sensitivity:

  errRate = (colSums(confusion) - diag(confusion)) / colSums(confusion)
  errRate = matrix(errRate, nrow = 1)

  colnames(errRate) = 0:9
  rownames(errRate) = "Error Rate"
  knitr::kable(errRate, digits = 3)

	0	1	2	3	4	5	6	7	8	9
Error Rate	0.047	0.049	0.207	0.145	0.13	0.219	0.076	0.122	0.187	0.068

The overall error rate is 0.115.

Question 3 [30 points] Regularized quadratic discriminate analysis

QDA uses a quadratic discriminant function. However, QDA does not work directly in this example because we do not have enough samples to provide an invertible sample covariance matrix for each digit. An alternative idea to fix this issue is to consider a regularized QDA method, which uses \[\widehat \Sigma_k(\alpha) = \alpha \widehat \Sigma_k + (1-\alpha) \widehat \Sigma \] instead of \(\Sigma_k\). Then, they are used in the decision rules given in page 36 of lecture notes. Complete the following questions

[20 pts] Write your own function myRQDA(testX, mu_list, sigma_list, sigma_pool, alpha), where allpha is a scaler alpha and testX is your testing covariate matrix. And you may need a new sigma_list for all the \(\Sigma_k\). This function should return a vector of predicted digits.
[10 pts] Perform regularized QDA with the following sequence of \(\alpha\) values. Plot the testing error (misclassification rate) against alpha. Report the minimum testing error and the corresponding \(\alpha\).

alpha_all = seq(0, 0.3, by = 0.05)

Answer:

  myRQDA <- function(testX, mu_list, sigma_list, sigma_pool, alpha){
  
    pred_val = matrix(NA, nrow = nrow(testX), ncol = 10)
    
    for(k in 1:10){
      sigma_k = (1 - alpha) * sigma_pool + alpha * sigma_list[[k]]
      sigma_k_inv = solve(sigma_k)
      x_center = testX - matrix(mu_list[[k]], nrow = nrow(testX), ncol = ncol(testX), byrow = TRUE)
      b = -0.5 * determinant(sigma_k)$modulus +  log(p[k])
      pred_val[, k] = b - 0.5 * apply(x_center, 1, function(x) {t(x) %*% sigma_k_inv %*% x})
    }
    
    pred_digit = apply(pred_val, 1, which.max) - 1
    return(pred_digit)
  }

  n_alpha = length(alpha_all)
  test_err = rep(NA, length(alpha_all))
  
  for(i in 1:n_alpha){
    alpha = alpha_all[i]
    pred = myRQDA(zip.test[, -1], mu_list, sigma_list, sigma_pool, alpha)
    test_err[i] = mean(pred != zip.test[, 1])
  }
  
  plot(alpha_all, test_err,  type = "b")

The minimum testing error is 0.0931739, and the best \(\alpha\) is 0.05.

STAT 542: Homework 7

Spring 2022, by Ruoqing Zhu (rqzhu)

Due: Thursday, March 24, 11:59 PM CT

Instruction

Question 1 [35 Points] Local Linear Regression

Question 2 [35 Points] Linear Discriminant Analysis

Question 3 [30 points] Regularized quadratic discriminate analysis

	0	1	2	3	4	5	6	7	8	9
0	342	0	7	3	1	6	1	0	5	0
1	0	251	2	0	4	0	0	1	0	0
2	0	0	157	3	6	0	3	0	2	0
3	4	2	4	142	0	16	0	2	11	0
4	3	5	12	3	174	3	3	7	7	4
5	1	0	2	9	0	125	3	0	4	0
6	5	3	1	0	2	0	157	0	0	0
7	0	0	1	1	2	0	0	129	0	5
8	3	1	12	4	1	5	3	1	135	3
9	1	2	0	1	10	5	0	7	2	165

	0	1	2	3	4	5	6	7	8	9
0	342	0	7	3	1	6	1	0	5	0
1	0	251	2	0	4	0	0	1	0	0
2	0	0	157	3	6	0	3	0	2	0
3	4	2	4	142	0	16	0	2	11	0
4	3	5	12	3	174	3	3	7	7	4
5	1	0	2	9	0	125	3	0	4	0
6	5	3	1	0	2	0	157	0	0	0
7	0	0	1	1	2	0	0	129	0	5
8	3	1	12	4	1	5	3	1	135	3
9	1	2	0	1	10	5	0	7	2	165

	0	1	2	3	4	5	6	7	8	9
0	342	0	7	3	1	6	1	0	5	0
1	0	251	2	0	4	0	0	1	0	0
2	0	0	157	3	6	0	3	0	2	0
3	4	2	4	142	0	16	0	2	11	0
4	3	5	12	3	174	3	3	7	7	4
5	1	0	2	9	0	125	3	0	4	0
6	5	3	1	0	2	0	157	0	0	0
7	0	0	1	1	2	0	0	129	0	5
8	3	1	12	4	1	5	3	1	135	3
9	1	2	0	1	10	5	0	7	2	165