About HW2
Question 1 [40 Points] KNN Classification (Diabetes)
Question 2 [40 Points] Write your own KNN for regression
Question 3 [30 Points] Curse of Dimensionality

About HW2

For this HW, we mainly try to understand the KNN method in both classification and regression settings and use it to perform several real data examples. Tuning the model will help us understand the bias-variance trade-off. A slightly more challenging task is to code a KNN method yourself. For that question, you cannot use any additional package to assist the calculation.

There is an important package, ElemStatLearn, which is the package associated with the ESL textbook for this course. Unfortunately the package is currently discontinued on CRAN. You can install an earlier version of this package by using

require(devtools)
install_version("ElemStatLearn", version = "2015.6.26.2", repos = "http://cran.us.r-project.org")

And of course you will have to install the devtools package if you don’t already have it.

Question 1 [40 Points] KNN Classification (Diabetes)

Load the Pima Indians Diabetes Database (PimaIndiansDiabetes) from the mlbench package. If you don’t already have this package installed, use the following code. It also randomly split the data into training and testing. You should preserve this split in the analysis.

# install.packages("mlbench") # run this line if you don't have the package
library(mlbench)
data(PimaIndiansDiabetes)

set.seed(2)
trainid = sample(1:nrow(PimaIndiansDiabetes), nrow(PimaIndiansDiabetes)/2)
Diab.train = PimaIndiansDiabetes[trainid, ]
Diab.test = PimaIndiansDiabetes[-trainid, ]

Read the documentation of this dataset here and make sure that you understand the goal of this is classification problem. Knn algorithms use different prediction strategies for classification and regression.

Use a grid of $k$ values (every integer) from 1 to 20.

[10 pts] Fit the KNN model using Diab.train and calculate both training and testing errors. For the testing error, use Diab.test. Plot the two errors against the corresponding $k$ values. Make sure that you differentiate them using different colors/shapes and add proper legends.

library(mlbench)
data(PimaIndiansDiabetes)

trainid = sample(1:nrow(PimaIndiansDiabetes), nrow(PimaIndiansDiabetes)/2)
Diab.train = PimaIndiansDiabetes[trainid, ]
Diab.test = PimaIndiansDiabetes[-trainid, ]

kgrid = c(1:20)
trainerr = rep(NA, length(kgrid))
testerr = rep(NA, length(kgrid))

for (i in 1:length(kgrid)){
  # Evaluate training and testing error separately
  trainerr[i] = mean(FNN::knn(train = Diab.train[, -9], test = Diab.train[, -9],
                  cl = as.factor(Diab.train[, 9]), k = kgrid[i]) != Diab.train[, 9])
  
  testerr[i] = mean(FNN::knn(train = Diab.train[, -9], test = Diab.test[, -9],
                 cl = as.factor(Diab.train[, 9]), k = kgrid[i]) != Diab.test[, 9])
}

plot(kgrid, testerr, ylim = c(0, max(testerr)), pch = 19, col = "red")
points(kgrid, trainerr, pch = 19, col = "blue")
legend("bottomright", c("train error", "test error"), col = c("blue", "red"), pch = 19, cex = 2)

Note that we transfer labels $y$ as factors by as.factor().

[15 pts] Does the plot match (approximately) our intuition of the bias-variance trade-off in terms of having an U-shaped error? What is the optimal $k$ value based on this result? For the optimal k, what is the freedom of the model and its error?

optK = which.min(testerr)
dof = nrow(Diab.train) / optK
sprintf('Optimal k: %d,   test error: %f,   df: %f', optK, testerr[optK],
        dof)

## [1] "Optimal k: 11,   test error: 0.257812,   df: 34.909091"

Answer: Yes, the testing error is U-shape. The optimal $k$ is 11 and the corresponding testing error is 0.2578125. DoF is 34.9090909. (The optimal k may vary due to the randomness in crossvalidation. Any reasonable testing error is acceptable) \[ df = n / k \] Here $n = 384$ is the number of training samples. The testing samples are not involved in training model so they are not relevant to the degrees of freedom.

[15 pts] Suppose we do not have access to Diab.test data. Thus, we need to further split the training data into train and validation data to tune k. For this question, use the caret package to complete the tuning. You are required to
- Train the knn model with cross-validation using the train() function in caret package.
  - Specify the type of cross-validation using the trainControl() function. We need to use 3 fold cross-validation.
  - Specify a grid of tuning parameters. This can be done using expand.grid(k = c(1:20)).
- Report the best parameter with its error. Compare it with your k in b).

For details, read the documentation at here to learn how to use the trainControl() and train() functions. Some examples can also be found at here.

You can use names(getModelInfo()) to get all the possible model parameters in train. Here we use method = 'knn'.

library(caret)

model = train(x = Diab.train[, -9], y = as.factor(Diab.train[, 9]), method = "knn",
          trControl = trainControl(method = "cv", number = 3), 
          tuneGrid = expand.grid(k = c(1:20)))

# the optimal k and testing error
optK = model$bestTune$k
err = 1- model$results$Accuracy[model$bestTune$k]
sprintf("Optimal k: %d,  testing err: %f", optK, err)

## [1] "Optimal k: 12,  testing err: 0.231771"

Answer The optimal k is 12` and its cv error is 0.2317708.

Remark One can see the model accuracy of each tuning parameter by typing model or model$results. (This is not required)

model

## k-Nearest Neighbors 
## 
## 384 samples
##   8 predictor
##   2 classes: 'neg', 'pos' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 256, 256, 256 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.6901042  0.3272540
##    2  0.6484375  0.2123182
##    3  0.7161458  0.3692994
##    4  0.7343750  0.4028874
##    5  0.7395833  0.4139364
##    6  0.7395833  0.4147061
##    7  0.7395833  0.4088362
##    8  0.7447917  0.4105564
##    9  0.7473958  0.4132025
##   10  0.7369792  0.3883334
##   11  0.7473958  0.4014025
##   12  0.7682292  0.4553177
##   13  0.7447917  0.3947892
##   14  0.7552083  0.4190417
##   15  0.7395833  0.3775050
##   16  0.7473958  0.3949736
##   17  0.7526042  0.4052555
##   18  0.7291667  0.3538634
##   19  0.7395833  0.3764656
##   20  0.7369792  0.3663101
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 12.

Question 2 [40 Points] Write your own KNN for regression

[10 pts] Generate the covariates $X_1, X_2, X_3, X_4, X_5$ of $n = 1000$ training data, with $p=5$ from independent standard Normal distribution. Then, generate $Y$ from \[ Y = X_1 + 0.5 \times X_2 - X_3 + \epsilon,\] with i.i.d. standard normal error $\epsilon$.

Use KNN function in existing packages. Report the mean squared error (MSE) for your prediction with k = 5. Use the first 500 observations as the training data and the rest as testing data. Predict the $Y$ values using your KNN function with k = 5. Mean squared error is \[\frac{1}{N}\sum_i (y_i - \widehat y_i)^2\]. This question also helps you validate your own function in b). a) and b) are expected have similar (possibly not identical) results.
Hints: this is a regression problem instead of a classification one.

# start with random seed 1
set.seed(1)
N = 1000; P = 5
X = matrix(rnorm(N*P), N, P)
Y = X[,1] + 0.5*X[,2] - X[,3] + rnorm(N)

ntrain = 500
xtrain = X[1:ntrain, ]
xtest  = X[(ntrain+1):N, ]  
ytrain = Y[1:ntrain]
ytest  = Y[(ntrain+1):N]

# mean squared error
yPred = FNN::knn.reg(train = xtrain, test = xtest, y = ytrain, k =5)
print(mean((yPred$pred - ytest)**2))

## [1] 1.393227

Answer The mean square error is 1.393.

[30 pts] For this question, you cannot use (load) any additional R package. Write your own function myknn(xtrain, ytrain, xtest, k) that fits a KNN model and predict multiple target points xtest. The function should return a variable ytest.
- Here xtrain is the training dataset covariate value, ytrain is the training data outcome, and k is the number of nearest neighbors. ytest is the prediction on xtest.
- Use the Euclidean distance to evaluate the closeness between two points.
- Test your code by reporting the mean square error on testing data.

myknn <- function(xtest, xtrain, ytrain, k){
  if (is.vector(xtest)){
    if (length(xtest) != ncol(xtrain)) stop("dimensions do no match")
    
    d = rowSums(abs(sweep(xtrain, MARGIN = 2, xtest, FUN = "-")))
    return( mean(ytrain[rank(d, ties.method = "random") <= k]) )
  }
  
  if (is.matrix(xtest)){
    if (ncol(xtest) != ncol(xtrain)) stop("dimensions do no match")
    
    ytest = rep(NA, nrow(xtest))
    for (i in 1:nrow(xtest)){
      # use sweep function to calculate pairwise distance between xtrain and xtest[i, ]
      # use rowSums for the sum of all dimensions
      d = rowSums(abs(sweep(xtrain, MARGIN = 2, xtest[i, ], FUN = "-")))
      ytest[i] = mean(ytrain[rank(d, ties.method = "random") <= k])
    }
    return(ytest)
  }
  stop("xtest needs to be a vector or a matrix")
}

mean((myknn(xtest, xtrain, ytrain, k = 5) - ytest)^2)

## [1] 1.429269

Answer The MSE of my knn function is 1.429.

Question 3 [30 Points] Curse of Dimensionality

Let’s consider a high-dimensional setting. Keep the model the same as question 3. In this question, you can use a KNN function from packages. We keep your $Y$ in question 3 but add 95 more noisy variables in training.

We consider two cases that both generate an additional set of 95 covariates:

Generate another 95-dimensional covariate with all independent standard Gaussian entries.
Generate another 95-dimensional covariate using the formula $X^T A$, where $X$ is the original 5-dimensional vector, and $A$ is a $5 \times 95$ dimensional (fixed) matrix that remains the same for all observations. You should generate $A$ only once using i.i.d. uniform $[0, 1]$ entries.

Make sure that you set seed when generating these covariates. Fit KNN in both settings (with the total of 100 covariates) and select the best $k$ value. Answer the following questions

[10 pts] For the first setting, what is the best $k$ and the best mean squared error for prediction?
[10 pts] For the second setting, what is the best $k$ and the best mean squared error for prediction?
[10 pts] In which setting $k$NN performs better? Why?

P = 100
set.seed(2)

allk <- c(1:40)
knnerrors <- rep(NA, length(allk))

X1 = cbind(X, matrix(rnorm(N*(P-5)), N, (P-5)))
X2 = cbind(X, X %*% matrix(runif(5*(P-5)), 5, P-5))

# case 1
xtrain = X1[1:ntrain, ]
xtest = X1[(ntrain+1):N, ]

for (i in 1:length(knnerrors))
  knnerrors[i] = mean((FNN::knn.reg(xtrain, xtest, y = ytrain, k = allk[i])$pred - ytest)^2)

mse1 <- min(knnerrors)
k1 <- allk[which.min(knnerrors)]
sprintf('a): Optimal k: %d,   error: %f', k1, mse1)

## [1] "a): Optimal k: 21,   error: 2.669830"

# case 2
xtrain = X2[1:ntrain, ]
xtest = X2[(ntrain+1):N, ]

for (i in 1:length(knnerrors))
  knnerrors[i] = mean((FNN::knn.reg(xtrain, xtest, y = ytrain, k = allk[i])$pred - ytest)^2)

mse2 <- min(knnerrors)
k2 <- allk[which.min(knnerrors)]
sprintf('b): Optimal k: %d,   error: %f', k2, mse2)

## [1] "b): Optimal k: 6,   error: 1.436950"

Answer

The best k and MSE for two settings:

1. 21 and 2.6698302
1. 6 and 1.4369503

According to the best MSE, the second setting has better performance. For setting 1, the additional variables add noise to the distance calculation, which renders a bad performance. For setting2, the additional variables are generated based on first 5 variables so the distance is still determined by the first 5 variables. In other words, the high dimensional data have some low dimensional structures

STAT 542: Homework 2

Spring 2022, by Ruoqing Zhu (rqzhu)

About HW2

Question 1 [40 Points] KNN Classification (Diabetes)

Question 2 [40 Points] Write your own KNN for regression

Question 3 [30 Points] Curse of Dimensionality