\(K\) nearest neighbor (KNN) is a simple nonparametric method. It can be used for both regression and classification problems. However, the idea is quite different from models we introduced before. In a linear model, we have a set of parameters \(\boldsymbol \beta\) and our estimated function value, for any target point \(x_0\) is \(x_0^\text{T}\boldsymbol \beta\). In KNN, we don’t really specify these parameters. Instead, we directed estimate \(f(x_0)\). This is traditionally called nonparametric models in statistics. Usually these models perform a local averaging technique to estimate \(f(x_0)\) using observations close to \(x_0\).
Suppose we collect a set of observations \(\{x_i, y_i\}_{i=1}^n\), the prediction at a new target point \(x_0\) is
\[\widehat y = \frac{1}{k} \sum_{x_i \in N_k(x_0)} y_i,\] where \(N_k(x_0)\) defines the \(k\) samples from the training data that are closest to \(x_0\). As default, closeness is defined using a distance measure, such as the Euclidean distance. Here is a demonstration of the fitted regression function in a one-dimensional case. We use the \(\sin\) function as the true underlying function.
# generate training data with 2*sin(x) and random Gaussian errors
set.seed(1)
x <- runif(15, 0, 2*pi)
y <- 2*sin(x) + rnorm(length(x))
# generate testing data points where we evaluate the prediction function
test.x = seq(0, 1, 0.001)*2*pi
# "1-nearest neighbor" regression using kknn package
library(kknn)
knn.fit = kknn(y ~ ., train = data.frame(x = x, y = y),
test = data.frame(x = test.x),
k = 1, kernel = "rectangular")
test.pred = knn.fit$fitted.values
# plot the data
par(mar=rep(2,4))
plot(x, y, xlim = c(0, 2*pi), pch = "o", cex = 2,
xlab = "", ylab = "", cex.lab = 1.5)
title(main="1-Nearest Neighbor Regression", cex.main = 1.5)
# plot the true regression line
lines(test.x, 2*sin(test.x), col = "deepskyblue", lwd = 3)
# plot the fitted line
lines(test.x, test.pred, type = "s", col = "darkorange", lwd = 3)
legend("topright", c("Fitted line", "True function"),
col = c("darkorange", "deepskyblue"), lty = 1, cex = 1.5)
Tuning \(k\) is crucial for \(k\)NN. Let’s observe its effect. This time, I generate 200 observations. For 1NN, the fitted regression line is very jumpy.
# generate more data
set.seed(1)
n = 200
x <- runif(n, 0, 2*pi)
y <- 2*sin(x) + rnorm(length(x))
test.y = 2*sin(test.x) + rnorm(length(test.x))
# 1-nearest neighbor
knn.fit = kknn(y ~ ., train = data.frame("x" = x, "y" = y),
test = data.frame("x" = test.x),
k = 1, kernel = "rectangular")
test.pred = knn.fit$fitted.values
We can evaluate the prediction error of this model:
# prediction error
mean((test.pred - test.y)^2)
## [1] 2.097488
If we consider different values of \(k\), we can observe the trade-off between bias and variance.
## Prediction Error for K = 1 : 2.09748780881932
## Prediction Error for K = 5 : 1.39071867992277
## Prediction Error for K = 10 : 1.24696415340282
## Prediction Error for K = 33 : 1.21589627474692
## Prediction Error for K = 66 : 1.37604707375972
## Prediction Error for K = 100 : 1.42868908518756
As \(k\) increases, we have a more stable model, i.e., smaller variance. However, the bias is also increased. As \(k\) decreases, the bias decreases, but the model is less stable.
Formally, the prediction error (at a given target point \(x_0\)) can be broke down into three parts: the irreducible error, the bias squared, and the variance.
\[\begin{aligned} \text{E}\Big[ \big( Y - \widehat f(x_0) \big)^2 \Big] &= \text{E}\Big[ \big( Y - f(x_0) + f(x_0) - \text{E}[\widehat f(x_0)] + \text{E}[\widehat f(x_0)] - \widehat f(x_0) \big)^2 \Big] \\ &= \text{E}\Big[ \big( Y - f(x_0) \big)^2 \Big] + \text{E}\Big[ \big(f(x_0) - \text{E}[\widehat f(x_0)] \big)^2 \Big] + \text{E}\Big[ \big(E[\widehat f(x_0)] - \widehat f(x_0) \big)^2 \Big] + \text{Cross Terms}\\ &= \underbrace{\text{E}\Big[ ( Y - f(x_0))^2 \big]}_{\text{Irreducible Error}} + \underbrace{\Big(f(x_0) - \text{E}[\widehat f(x_0)]\Big)^2}_{\text{Bias}^2} + \underbrace{\text{E}\Big[ \big(\widehat f(x_0) - \text{E}[\widehat f(x_0)] \big)^2 \Big]}_{\text{Variance}} \end{aligned}\]As we can see from the previous example, when \(k=1\), the prediction error is about 2. This is because for all the testing points, the theoretical irreducible error is 1 (variance of the error term), the bias is almost 0 since the function is smooth, and the variance is the variance of 1 nearest neighbor, which is again 1. On the other extreme side, when \(k = n\), the variance should be in the level of \(1/n\), the bias is the difference between the \(\sin\) function and the overall average. Again, to summarize the trade-off:
For classification, kNN is different from the regression model in term of finding neighbors. The only difference is to majority voting instead of averaging. Majority voting means that we look for the most popular class label among its neighbors. For 1NN, it is simply the class of the closest neighbor. The visualization of 1NN is a Voronoi tessellation. The plot on the left is some randomly observed data in \([0, 1]^2\), and the plot on the right is the corresponding 1NN classification model.
We use artificial data from the ElemStatLearn
package.
library(ElemStatLearn)
x <- mixture.example$x
y <- mixture.example$y
xnew <- mixture.example$xnew
par(mar=rep(2,4))
plot(x, col=ifelse(y==1, "darkorange", "deepskyblue"),
axes = FALSE, pch = 19)
box()
The decision boundary seems to be highly nonlinear. We can utilize
the contour()
function to demonstrate the result of \(k=15\).
# knn classification
k = 15
knn.fit <- knn(x, xnew, y, k=k)
px1 <- mixture.example$px1
px2 <- mixture.example$px2
pred <- matrix(knn.fit == "1", length(px1), length(px2))
contour(px1, px2, pred, levels=0.5, labels="",axes=FALSE)
box()
title(paste(k, "-Nearest Neighbor", sep= ""))
points(x, col=ifelse(y==1, "darkorange", "deepskyblue"), pch = 19)
mesh <- expand.grid(px1, px2)
points(mesh, pch=".", cex=1.2, col=ifelse(pred, "darkorange", "deepskyblue"))
We can evaluate the in-sample prediction result (since there is no testing sample available) of this model using a confusion matrix:
# the confusion matrix
knn.fit <- knn(x, x, y, k = 15)
xtab = table(knn.fit, y)
library(caret)
confusionMatrix(xtab)
## Confusion Matrix and Statistics
##
## y
## knn.fit 0 1
## 0 82 13
## 1 18 87
##
## Accuracy : 0.845
## 95% CI : (0.7873, 0.8922)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.69
##
## Mcnemar's Test P-Value : 0.4725
##
## Sensitivity : 0.8200
## Specificity : 0.8700
## Pos Pred Value : 0.8632
## Neg Pred Value : 0.8286
## Prevalence : 0.5000
## Detection Rate : 0.4100
## Detection Prevalence : 0.4750
## Balanced Accuracy : 0.8450
##
## 'Positive' Class : 0
##
caret
PackageThe caret
package has some built-in feature that can
tune some popular machine learning models using cross-validation. The
cross-validation settings need to be specified using the
trainControl()
function.
library(caret)
control <- trainControl(method = "cv", number = 10)
There are other cross-validation methods, such as
repeatedcv
the repeats the CV several times, and
leave-one-out CV LOOCV
. For more details, you can read the
documentation.
We can then setup the training by specifying a grid of \(k\) values, and also the CV setup. Make
sure that you specify method = "knn"
and also construct the
outcome as a factor in a data frame.
set.seed(1)
knn.cvfit <- train(y ~ ., method = "knn",
data = data.frame("x" = x, "y" = as.factor(y)),
tuneGrid = data.frame(k = seq(1, 40, 1)),
trControl = control)
plot(knn.cvfit$results$k, 1-knn.cvfit$results$Accuracy,
xlab = "K", ylab = "Classification Error", type = "b",
pch = 19, col = "darkorange")
Print out the fitted object, we can see that the best \(k\) is 6. And there is a clear “U” shaped pattern that shows the potential bias-variance trade-off.
Closeness between two points needs to be defined based on some distance measures. By default, we use the squared Euclidean distance (\(\ell_2\) norm) for continuous variables:
\[d^2(\mathbf{u}, \mathbf{v}) = \lVert \mathbf{u}- \mathbf{v}\rVert_2^2 = \sum_{j=1}^p (u_j - v_j)^2.\] However, this measure is not scale invariant. A variable with large scale can dominate this measure. Hence, we often consider a normalized version:
\[d^2(\mathbf{u}, \mathbf{v}) = \sum_{j=1}^p \frac{(u_j - v_j)^2}{\sigma_j^2},\] where \(\sigma_j^2\) can be estimated using the sample variance of variable \(j\). Another choice that further taking the covariance structure into consideration is the Mahalanobis distance:
\[d^2(\mathbf{u}, \mathbf{v}) = (\mathbf{u}- \mathbf{v})^\text{T}\Sigma^{-1} (\mathbf{u}- \mathbf{v}),\] where \(\Sigma\) is the covariance matrix, and can be estimated using the sample covariance. In the following plot, the red cross and orange cross have the same Euclidean distance to the center. However, the red cross is more of a “outlier” based on the joint distribution. The Mahalanobis distance would reflect this.
x=rnorm(100)
y=1 + 0.3*x + 0.3*rnorm(100)
library(car)
dataEllipse(x, y, levels=c(0.6, 0.9), col = c("black", "deepskyblue"), pch = 19)
points(1, 0.5, col = "red", pch = 4, cex = 2, lwd = 4)
points(1, 1.5, col = "darkorange", pch = 4, cex = 3, lwd = 4)
For categorical variables, the Hamming distance is commonly used. It simply counts how many entries are not the same.
\[d(\mathbf{u}, \mathbf{v}) = \sum_{j=1}^p I(u_j \neq v_j).\]
\(K\)NN can be quite computationally
intense for large sample size because to find the nearest neighbors, we
need to calculate and compare the distances to each of the data point.
In addition, it is not memory friendly because we need to store the
entire training dataset for future prediction. In contrast, for linear
model, we only need to store the estimated \(\boldsymbol \beta\) parameters. Some
algorithms have been developed to search for the neighbors more
efficiently. You can try the FNN
package for these faster
computations when \(n\) is large.
You should be careful about which function to use when performing
\(k\)NN. Some packages are for
regression and some are for classification. Not many functions can be
used for both purposes. Try using ??knn
to see a summary of
different functions.