Let’s consider another example using the handwritten digit data. Each observation in this data is a \(16 \times 16\) pixel image. Hence, the total number of variables is \(256\). At each pixel, we have the gray scale as the numerical value.
# Handwritten Digit Recognition Data
library(ElemStatLearn)
# the first column is the true digit
dim(zip.train)
## [1] 7291 257
dim(zip.test)
## [1] 2007 257
# look at one sample
image(zip2image(zip.train, 1), col=gray(256:0/256), zlim=c(0,1),
xlab="", ylab="", axes = FALSE)
## [1] "digit 6 taken"
We use 3NN to predict all samples in the testing data. The model is fairly accurate.
library(class)
# fit 3nn model and calculate the error
knn.fit <- knn(zip.train[, 2:257], zip.test[, 2:257], zip.train[, 1], k=3)
# overall prediction error
mean(knn.fit != zip.test[,1])
## [1] 0.05530643
# the confusion matrix
table(knn.fit, zip.test[,1])
##
## knn.fit 0 1 2 3 4 5 6 7 8 9
## 0 355 0 7 2 0 2 3 0 4 1
## 1 0 257 0 0 2 0 0 1 0 0
## 2 3 0 183 2 0 2 1 1 1 0
## 3 0 0 1 153 0 5 0 1 5 0
## 4 0 3 1 0 183 0 2 4 0 3
## 5 0 0 0 7 2 145 1 0 2 1
## 6 0 2 1 0 2 1 163 0 0 0
## 7 0 2 2 1 2 0 0 138 1 4
## 8 0 0 3 0 1 1 0 1 151 0
## 9 1 0 0 1 8 4 0 1 2 168
Many of the practical problems we encounter today are high-dimensional. The resolution of the handwritten digit example is \(16 \times 16 = 256\). Genetic studies often involves more than 25K gene expressions, etc. For a given sample size \(n\), as the number of variables \(p\) increases, the data becomes very sparse. Nearest neighbor methods usually do not perform very well on high-dimensional data. This is because for any given target point, there will not be enough training data that lies close enough. To see this, let’s consider a \(p\)-dimensional hyper-cube. Suppose we have \(n=1000\) observations uniformly spread out in this cube, and we are interested in predicting a target point with \(k=10\) neighbors. If these neighbors are really close to the target point, then this would be a good estimation with small bias. Suppose \(p=2\). We know that if we take a cube (square) with both height and width \(l = 0.1\), then there will be \(1000 \times 0.1^2 = 10\) observations within the square. In general, we have the relationship
\[l^p = \frac{k}{n}\] Try different \(p\), we have
This implies that if we have 100 dimensions, then the nearest 10 observations would be 0.955 away from the target point at each dimension, this is almost at the other corner of the cube. Hence there will be a very large bias. Decreasing \(k\) does not help much in this situation since even the closest point can be really far away, and the model would have large variance.
Let’s try a small experiment on this high-dimensional issue. Suppose we have a 10 dimensional setting with covariates uniformly distributed on \((0, 1)^{10}\), and Suppose the outcome is a regression model that depends only on the first variable. The following code setup this model. And we want to know what is the expected performance. Recall the idea of simulation in the Ridge regression, we could repeatedly perform this many times to estimate the approximation error:
\[ E \Big[ \big(\widehat{f}(x_0) - f(x_0) \big)^2 \Big]\] where \(x_0 = c(0, 0, 0, \ldots, 0)\).
# Let's try a new package FNN
library(FNN)
p = 10
n = 100
# the target prediction point
# be careful that this needs to be a 1xp matrix
x0 = matrix(rep(0, p), nrow = 1, ncol = p)
# number of simulations
nsim = 300
# vector to store predicted values
allerror = rep(NA, nsim)
for (l in 1:nsim)
{
# generate data
X = matrix(runif(n*p), nrow = n, ncol = p)
y = X[, 1] + X[, 2] + X[, 3] + rnorm(n, sd = 0.5)
# "5-nearest neighbor" regression using the FNN package
# for this question, use the "brute force" algorithm to search for the NNs
knn.fit = knn.reg(train = X, test = x0, y = y,
k = 5, algorithm = "brute")
# record the prediction error of this run
# the truth f(x_0) is 0
allerror[l] = (knn.fit$pred - 0)^2
}
# the prediction error
mean(allerror)
## [1] 1.052382
Use the same code, calculate the prediction error at \(x_0 = c(1, 1, 1, 0, \ldots, 0)\). What are the prediction errors if \(p = 5, 10\) and \(20\)?
# Let's try a new package FNN
library(FNN)
p = 20
n = 100
# the target prediction point
# be careful that this needs to be a 1xp matrix
x0 = matrix(c(rep(1, 3), rep(0, p-3)), nrow = 1, ncol = p)
# number of simulations
nsim = 300
# vector to store predicted values
allerror = rep(NA, nsim)
for (l in 1:nsim)
{
# generate data
X = matrix(runif(n*p), nrow = n, ncol = p)
y = X[, 1] + X[, 2] + X[, 3] + rnorm(n, sd = 0.5)
# "5-nearest neighbor" regression using the FNN package
# for this question, use the "brute force" algorithm to search for the NNs
knn.fit = knn.reg(train = X, test = x0, y = y,
k = 5, algorithm = "brute")
# record the prediction error of this run
# the truth f(x0) is 3
allerror[l] = (knn.fit$pred - 3)^2
}
# the prediction error
mean(allerror)
However, why our model performs well in the handwritten digit example? There is possibly (approximately) a lower dimensional representation of the data so that when you evaluate the distance on the high-dimensional space, it is just as effective as working on the low dimensional space. Moreover, the digit classes can be effectively separated if we only focus this low dimensional space. Hence, the performance can be satisfactory.
Dimension reduction is an important topic in statistical learning and machine learning. Many methods, such as sliced inverse regression (Li, 1991) and UMAP (McInnes, et al. 2018) have been developed based on this concept.
Think about the previous example of the dimensionality issue. What if all covariates are highly linearly dependent? Would that lead to an approximately low dimensional representation and potentially improve the prediction accuracy? Can you use a simulation study to confirm that?