For this HW, we mainly try to understand the KNN method in both classification and regression settings and use it to perform several real data examples. Tuning the model will help us understand the bias-variance trade-off. A slightly more challenging task is to code a KNN method yourself. For that question, you cannot use any additional package to assist the calculation.
There is an important package, ElemStatLearn
, which is the package associated with the ESL textbook for this course. Unfortunately the package is currently discontinued on CRAN. You can install an earlier version of this package by using
require(devtools)
install_version("ElemStatLearn", version = "2015.6.26.2", repos = "http://cran.us.r-project.org")
And of course you will have to install the devtools
package if you don’t already have it.
Load the Pima Indians Diabetes Database (PimaIndiansDiabetes
) from the mlbench
package. If you don’t already have this package installed, use the following code. It also randomly split the data into training and testing. You should preserve this split in the analysis.
# install.packages("mlbench") # run this line if you don't have the package
library(mlbench)
data(PimaIndiansDiabetes)
set.seed(2)
trainid = sample(1:nrow(PimaIndiansDiabetes), nrow(PimaIndiansDiabetes)/2)
Diab.train = PimaIndiansDiabetes[trainid, ]
Diab.test = PimaIndiansDiabetes[-trainid, ]
Read the documentation of this dataset here and make sure that you understand the goal of this is classification problem. Knn algorithms use different prediction strategies for classification and regression.
Use a grid of \(k\) values (every integer) from 1 to 20.
Diab.train
and calculate both training and testing errors. For the testing error, use Diab.test
. Plot the two errors against the corresponding \(k\) values. Make sure that you differentiate them using different colors/shapes and add proper legends.library(mlbench)
data(PimaIndiansDiabetes)
trainid = sample(1:nrow(PimaIndiansDiabetes), nrow(PimaIndiansDiabetes)/2)
Diab.train = PimaIndiansDiabetes[trainid, ]
Diab.test = PimaIndiansDiabetes[-trainid, ]
kgrid = c(1:20)
trainerr = rep(NA, length(kgrid))
testerr = rep(NA, length(kgrid))
for (i in 1:length(kgrid)){
# Evaluate training and testing error separately
trainerr[i] = mean(FNN::knn(train = Diab.train[, -9], test = Diab.train[, -9],
cl = as.factor(Diab.train[, 9]), k = kgrid[i]) != Diab.train[, 9])
testerr[i] = mean(FNN::knn(train = Diab.train[, -9], test = Diab.test[, -9],
cl = as.factor(Diab.train[, 9]), k = kgrid[i]) != Diab.test[, 9])
}
plot(kgrid, testerr, ylim = c(0, max(testerr)), pch = 19, col = "red")
points(kgrid, trainerr, pch = 19, col = "blue")
legend("bottomright", c("train error", "test error"), col = c("blue", "red"), pch = 19, cex = 2)
Note that we transfer labels \(y\) as factors by as.factor()
.
k
, what is the freedom of the model and its error?optK = which.min(testerr)
dof = nrow(Diab.train) / optK
sprintf('Optimal k: %d, test error: %f, df: %f', optK, testerr[optK],
dof)
## [1] "Optimal k: 11, test error: 0.257812, df: 34.909091"
Answer: Yes, the testing error is U-shape. The optimal \(k\) is 11 and the corresponding testing error is 0.2578125. DoF is 34.9090909. (The optimal k may vary due to the randomness in crossvalidation. Any reasonable testing error is acceptable) \[ df = n / k \] Here \(n = 384\) is the number of training samples. The testing samples are not involved in training model so they are not relevant to the degrees of freedom.
Diab.test
data. Thus, we need to further split the training data into train and validation data to tune k
. For this question, use the caret
package to complete the tuning. You are required to
train()
function in caret
package.
trainControl()
function. We need to use 3 fold cross-validation.expand.grid(k = c(1:20))
.k
in b).For details, read the documentation at here to learn how to use the trainControl()
and train()
functions. Some examples can also be found at here.
You can use names(getModelInfo())
to get all the possible model parameters in train
. Here we use method = 'knn'
.
library(caret)
model = train(x = Diab.train[, -9], y = as.factor(Diab.train[, 9]), method = "knn",
trControl = trainControl(method = "cv", number = 3),
tuneGrid = expand.grid(k = c(1:20)))
# the optimal k and testing error
optK = model$bestTune$k
err = 1- model$results$Accuracy[model$bestTune$k]
sprintf("Optimal k: %d, testing err: %f", optK, err)
## [1] "Optimal k: 12, testing err: 0.231771"
Answer The optimal k
is 12` and its cv error is 0.2317708.
Remark One can see the model accuracy of each tuning parameter by typing model
or model$results
. (This is not required)
model
## k-Nearest Neighbors
##
## 384 samples
## 8 predictor
## 2 classes: 'neg', 'pos'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 256, 256, 256
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.6901042 0.3272540
## 2 0.6484375 0.2123182
## 3 0.7161458 0.3692994
## 4 0.7343750 0.4028874
## 5 0.7395833 0.4139364
## 6 0.7395833 0.4147061
## 7 0.7395833 0.4088362
## 8 0.7447917 0.4105564
## 9 0.7473958 0.4132025
## 10 0.7369792 0.3883334
## 11 0.7473958 0.4014025
## 12 0.7682292 0.4553177
## 13 0.7447917 0.3947892
## 14 0.7552083 0.4190417
## 15 0.7395833 0.3775050
## 16 0.7473958 0.3949736
## 17 0.7526042 0.4052555
## 18 0.7291667 0.3538634
## 19 0.7395833 0.3764656
## 20 0.7369792 0.3663101
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 12.
k = 5
. Use the first 500 observations as the training data and the rest as testing data. Predict the \(Y\) values using your KNN function with k = 5
. Mean squared error is \[\frac{1}{N}\sum_i (y_i - \widehat y_i)^2\]. This question also helps you validate your own function in b). a) and b) are expected have similar (possibly not identical) results.# start with random seed 1
set.seed(1)
N = 1000; P = 5
X = matrix(rnorm(N*P), N, P)
Y = X[,1] + 0.5*X[,2] - X[,3] + rnorm(N)
ntrain = 500
xtrain = X[1:ntrain, ]
xtest = X[(ntrain+1):N, ]
ytrain = Y[1:ntrain]
ytest = Y[(ntrain+1):N]
# mean squared error
yPred = FNN::knn.reg(train = xtrain, test = xtest, y = ytrain, k =5)
print(mean((yPred$pred - ytest)**2))
## [1] 1.393227
Answer The mean square error is 1.393.
R
package. Write your own function myknn(xtrain, ytrain, xtest, k)
that fits a KNN model and predict multiple target points xtest
. The function should return a variable ytest
.
xtrain
is the training dataset covariate value, ytrain
is the training data outcome, and k
is the number of nearest neighbors. ytest
is the prediction on xtest
.myknn <- function(xtest, xtrain, ytrain, k){
if (is.vector(xtest)){
if (length(xtest) != ncol(xtrain)) stop("dimensions do no match")
d = rowSums(abs(sweep(xtrain, MARGIN = 2, xtest, FUN = "-")))
return( mean(ytrain[rank(d, ties.method = "random") <= k]) )
}
if (is.matrix(xtest)){
if (ncol(xtest) != ncol(xtrain)) stop("dimensions do no match")
ytest = rep(NA, nrow(xtest))
for (i in 1:nrow(xtest)){
# use sweep function to calculate pairwise distance between xtrain and xtest[i, ]
# use rowSums for the sum of all dimensions
d = rowSums(abs(sweep(xtrain, MARGIN = 2, xtest[i, ], FUN = "-")))
ytest[i] = mean(ytrain[rank(d, ties.method = "random") <= k])
}
return(ytest)
}
stop("xtest needs to be a vector or a matrix")
}
mean((myknn(xtest, xtrain, ytrain, k = 5) - ytest)^2)
## [1] 1.429269
Answer The MSE of my knn function is 1.429.
Let’s consider a high-dimensional setting. Keep the model the same as question 3. In this question, you can use a KNN function from packages. We keep your \(Y\) in question 3 but add 95 more noisy variables in training.
We consider two cases that both generate an additional set of 95 covariates:
Make sure that you set seed when generating these covariates. Fit KNN in both settings (with the total of 100 covariates) and select the best \(k\) value. Answer the following questions
P = 100
set.seed(2)
allk <- c(1:40)
knnerrors <- rep(NA, length(allk))
X1 = cbind(X, matrix(rnorm(N*(P-5)), N, (P-5)))
X2 = cbind(X, X %*% matrix(runif(5*(P-5)), 5, P-5))
# case 1
xtrain = X1[1:ntrain, ]
xtest = X1[(ntrain+1):N, ]
for (i in 1:length(knnerrors))
knnerrors[i] = mean((FNN::knn.reg(xtrain, xtest, y = ytrain, k = allk[i])$pred - ytest)^2)
mse1 <- min(knnerrors)
k1 <- allk[which.min(knnerrors)]
sprintf('a): Optimal k: %d, error: %f', k1, mse1)
## [1] "a): Optimal k: 21, error: 2.669830"
# case 2
xtrain = X2[1:ntrain, ]
xtest = X2[(ntrain+1):N, ]
for (i in 1:length(knnerrors))
knnerrors[i] = mean((FNN::knn.reg(xtrain, xtest, y = ytrain, k = allk[i])$pred - ytest)^2)
mse2 <- min(knnerrors)
k2 <- allk[which.min(knnerrors)]
sprintf('b): Optimal k: %d, error: %f', k2, mse2)
## [1] "b): Optimal k: 6, error: 1.436950"
Answer
The best k and MSE for two settings: