Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to compass2g. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1 [35 Points] Local Linear Regression

We have implemented the Nadaraya-Watson kernel estimator in HW 6. In this question, we will investigate a local linear regression: \[ \widehat{f}\left(x\right)=\widehat{\beta}_{0}\left(x\right)+ \widehat{\beta}_{1}\left(x\right) x, \] where \(x\) is a testing point. Local coefficients \(\widehat{\beta}_{r}\left(x \right)\) for \(r=0, 1\) are obtained by minimizing the object function \[ \underset{\beta_{0}(x), \, \beta_{1}(x)}{\operatorname{minimize}} \quad \sum_{i=1}^{n} K_{\lambda} \left(x, x_{i}\right) \Big[y_{i}-\beta_{0}(x) - \beta_1(x) x_{i} \Big]^{2}. \]

In this question, we will use the Gaussian kernel \(K(u) = \frac{1}{\sqrt{2 \pi}} e^{- \frac{u^2}{2}}\).

  1. [20 pts] Write a function myLocLinear(trainX, trainY, testX, lambda), where lambda is the bandwidth and testX is all testing samples. This function returns predictions on testX. The solution of \(\beta_{0}(x)\) and \(\beta_{1}(x)\) can be obtained by fitting a weighted linear regression. The formula is provided on Page 25 of our lecture note.

  2. [15 pts] Fit a local linear regression with our given training data. The testing data are generated using the code given below. Try a set of bandwidth \(\lambda = 0.05, 0.1, \ldots, 0.55, 0.6\) when calculating the kernel function.

  train = read.csv('hw7_Q1_train.csv')
  testX = 2 * pi * seq(0, 1, by = 0.01)
  testY = sin(testX)
  plot(train$x, train$y, pch = 19, cex = 0.3, xlab = "x", ylab = "y")
  lines(testX, testY, col = "darkorange", lwd=2.0)
  legend("topright", legend = c("Truth"), 
         col = c("darkorange"), lty = 1, cex = 2)

Solution

  # set.seed(1)
  # n = 500
  # trainX = runif(n) * 2 * pi
  # trainY = sin(trainX) + 0.2 * rnorm(n)
  # write.csv(data.frame(x = trainX, y = trainY), 'hw7_Q1_train.csv', row.names = FALSE)

  # Gaussian Kernel
  GaussianKernel <- function(x, y, bw){
    exp(- ((x - y)/bw)^2 / 2)
  }

  myLocLinear <- function(trainX, trainY, testX, lambda){
    
    nTest = length(testX)
    pred  = rep(NA, nTest)

    # the design matrix
    X = cbind(1, trainX)
    
    for(i in 1:nTest){    
    
      # get kernel weights for this testing sample
      K = GaussianKernel(testX[i], trainX, lambda)
    
      # (XWX)^(-1) XWY
      WX = sweep(X, 1, STATS = K, FUN = "*")
      
      # get beta
      beta = solve(t(X) %*% WX) %*% t(WX) %*% trainY
    
      # get prediction
      pred[i] = beta[1] + beta[2] * testX[i]
      
    }
    
    return(pred)
  }
  lambda_all = seq(0.05, 0.6, 0.05)
  n_lambda = length(lambda_all)
  test_mse = rep(0, n_lambda)
  pred_mat = matrix(NA, nrow = length(testX), ncol = n_lambda)
  
  for(i in 1:n_lambda){
    pred = myLocLinear(train$x, train$y, testX, lambda = lambda_all[i])
    test_mse[i] = mean((pred - testY)^2)
    pred_mat[, i] = pred
  }
  plot(lambda_all, test_mse, xlab = "lambda", ylab = "test MSE")

Hence, the minimum testing error is 0.0015754, and the best bandwidth \(\lambda\) is 0.25.

  par(mfrow = c(1, 3))
  plot.idx = c(1, 5, 10)
  
  for(i in plot.idx){
    plot(train$x, train$y, xlab = "x", ylab = "y", pch = 19, cex = 0.3,
         main = paste0("Lambda = ", lambda_all[i]))
    lines(testX, sin(testX), col = "darkorange", lwd=2.0)
    lines(testX, pred_mat[, i], col = "deepskyblue", lwd=2.0)
    legend("topright", legend = c("Truth", "Fitted curve"), 
           col = c("darkorange", "deepskyblue"), lty = 1)
  }

Comment: The curve in the left panel is under-smoothed while that in the right panel is over-smoothed.

Question 2 [35 Points] Linear Discriminant Analysis

For both question 2 and 3, you need to write your own code. We will use the handwritten digit recognition data from the ElemStatLearn package. We only consider the train-test split, with the pre-defined zip.train and zip.test. Simply use zip.train as the training data, and zip.test as the testing data for all evaluations and tuning. No cross-validation is needed in the training process.

  library(ElemStatLearn)

  # load training and testing data
  dim(zip.train)
## [1] 7291  257
  dim(zip.test)
## [1] 2007  257
  # number of each digit
  table(zip.train[, 1])
## 
##    0    1    2    3    4    5    6    7    8    9 
## 1194 1005  731  658  652  556  664  645  542  644
  1. [10 pts] Estimate the mean, covariance matrix of each class and pooled covariance matrix. Basic built-in R functions such as cov are allowed. Do NOT print your results.
  2. [15 pts] Write your own linear discriminate analysis (LDA) code following our lecture note. To perform this, you should calculate \(\mu_k\), \(\pi_k\), and \(\Sigma\) from the data. You may consider saving \(\mu_k\)’s and \(\pi_k\)’s as a list (with 10 elements in each list).

You are not required to write a single function to perform LDA, but you could consider defining a function as myLDA(testX, mu_list, sigma_pool), where mu_list is the estimated mean vector for each class, and sigma_pool is the pooled variance estimation. This function should return the predicted class based on comparing discriminant functions \(\delta_k(x) = w_k^T x + b_k\) given on page 32 of the lecture note.

  1. [10 pts] Fit LDA model on the training data and predict with the testing data.

Answer:

Part a):

  # define training X and Y
  trainX = zip.train[, -1]
  trainY = zip.train[, 1]
  N = nrow(zip.train)
  count = table(zip.train[, 1])
  p = count / N
  
  # calculate all components
  mu_list = list()
  sigma_list = list()
  sigma = 0
  
  # estimate mean
  for(k in 1:10){
    mu_list[[k]] = colMeans(trainX[trainY == k-1, ])
    sigma_list[[k]] = cov(trainX[trainY == k-1, ])
    sigma = sigma + sigma_list[[k]]*(count[k] - 1)
  }
  
  # the pooled variance
  sigma_pool = sigma / (N - 10)

Part b):

  myLDA <- function(testX, mu_list, sigma_pool){

    sigma_inv = solve(sigma_pool)
    
    # evaluate the coefficients
    DFun = matrix(NA, nrow(testX), length(mu_list))
    
    for (k in 1:10){
      w = sigma_inv %*% mu_list[[k]]
      b = as.numeric(-0.5 * t(mu_list[[k]]) %*% sigma_inv %*% mu_list[[k]] + log(p[k]))
      
      DFun[, k] = testX %*% as.vector(w) + b
    }
    
    return(apply(DFun, MARGIN = 1, which.max) - 1)
  }

Part c):

  LDA_fit = myLDA(zip.test[, -1], mu_list, sigma_pool)

  confusion = table(LDA_fit, zip.test[, 1])
  knitr::kable(confusion)
0 1 2 3 4 5 6 7 8 9
0 342 0 7 3 1 6 1 0 5 0
1 0 251 2 0 4 0 0 1 0 0
2 0 0 157 3 6 0 3 0 2 0
3 4 2 4 142 0 16 0 2 11 0
4 3 5 12 3 174 3 3 7 7 4
5 1 0 2 9 0 125 3 0 4 0
6 5 3 1 0 2 0 157 0 0 0
7 0 0 1 1 2 0 0 129 0 5
8 3 1 12 4 1 5 3 1 135 3
9 1 2 0 1 10 5 0 7 2 165
  # the first 5 entries of digit 0 is 
  sigma_inv = solve(sigma_pool)
  (sigma_inv %*% mu_list[[1]])[1:5]
## [1] -549.553021   68.575047  -39.112632   -3.095659   -9.656750
  # b is 
  as.numeric(-0.5 * t(mu_list[[1]]) %*% sigma_inv %*% mu_list[[1]] + log(p[1]))
## [1] -1156.443

\(1 -\) Sensitivity:

  errRate = (colSums(confusion) - diag(confusion)) / colSums(confusion)
  errRate = matrix(errRate, nrow = 1)

  colnames(errRate) = 0:9
  rownames(errRate) = "Error Rate"
  knitr::kable(errRate, digits = 3)
0 1 2 3 4 5 6 7 8 9
Error Rate 0.047 0.049 0.207 0.145 0.13 0.219 0.076 0.122 0.187 0.068

The overall error rate is 0.115.

Question 3 [30 points] Regularized quadratic discriminate analysis

QDA uses a quadratic discriminant function. However, QDA does not work directly in this example because we do not have enough samples to provide an invertible sample covariance matrix for each digit. An alternative idea to fix this issue is to consider a regularized QDA method, which uses \[\widehat \Sigma_k(\alpha) = \alpha \widehat \Sigma_k + (1-\alpha) \widehat \Sigma \] instead of \(\Sigma_k\). Then, they are used in the decision rules given in page 36 of lecture notes. Complete the following questions

  1. [20 pts] Write your own function myRQDA(testX, mu_list, sigma_list, sigma_pool, alpha), where allpha is a scaler alpha and testX is your testing covariate matrix. And you may need a new sigma_list for all the \(\Sigma_k\). This function should return a vector of predicted digits.
  2. [10 pts] Perform regularized QDA with the following sequence of \(\alpha\) values. Plot the testing error (misclassification rate) against alpha. Report the minimum testing error and the corresponding \(\alpha\).
alpha_all = seq(0, 0.3, by = 0.05)

Answer:

  myRQDA <- function(testX, mu_list, sigma_list, sigma_pool, alpha){
  
    pred_val = matrix(NA, nrow = nrow(testX), ncol = 10)
    
    for(k in 1:10){
      sigma_k = (1 - alpha) * sigma_pool + alpha * sigma_list[[k]]
      sigma_k_inv = solve(sigma_k)
      x_center = testX - matrix(mu_list[[k]], nrow = nrow(testX), ncol = ncol(testX), byrow = TRUE)
      b = -0.5 * determinant(sigma_k)$modulus +  log(p[k])
      pred_val[, k] = b - 0.5 * apply(x_center, 1, function(x) {t(x) %*% sigma_k_inv %*% x})
    }
    
    pred_digit = apply(pred_val, 1, which.max) - 1
    return(pred_digit)
  }
  n_alpha = length(alpha_all)
  test_err = rep(NA, length(alpha_all))
  
  for(i in 1:n_alpha){
    alpha = alpha_all[i]
    pred = myRQDA(zip.test[, -1], mu_list, sigma_list, sigma_pool, alpha)
    test_err[i] = mean(pred != zip.test[, 1])
  }
  
  plot(alpha_all, test_err,  type = "b")

The minimum testing error is 0.0931739, and the best \(\alpha\) is 0.05.