KNN vs. Kernel

We first compare the \(K\)NN method with a Gaussian kernel regression. \(K\)NN has jumps while Gaussian kernel regression is smooth.

Gaussian Kernel Regression

A Nadaraya-Watson kernel regression is done by

\[\widehat f(x) = \frac{\sum_i K_h(x, x_i) y_i}{\sum_i K_h(x, x_i)},\] where \(h\) is a bandwidth that controls the distance. At each target point \(x\), only the training data that are closer to \(x\) receives higher weights \(K_h(x, x_i)\), hence their \(y_i\) values are more influencial in terms of estimating \(f(x)\). For Gaussian kernel, we use

\[K_h(x, x_i) = \frac{1}{h\sqrt{2\pi}} \exp\left\{ -\frac{(x - x_i)^2}{2 h^2}\right\}\]

The bandwidth \(h\) is an important tuning parameter that controls the bias-variance trade-off. By setting a large \(h\), the estimator is more stable but has more bias.

    h = 2.5
    ksmooth.fit = ksmooth(x, y, bandwidth = h, kernel = "normal", x.points = testx)

Other kernel functions can also be defined. The most efficient kernel is the Epanechnikov kernel, which will minimize the mean integrated squared error (MISE). The efficiency is defined as

\[ \Big(\int u^2K(u) du\Big)^\frac{1}{2} \int K^2(u) du, \] where \(u\) is difference of two points. Different kernel functions can be visualized in the following. Most kernels are bounded within \([-h/2, h/2]\).

Local Linear Regression

Local averaging will suffer severe bias at the boundaries. One solution is to use the local polynomial regression. The following examples are local linear regressions, evaluated as different target points. We are solving for a linear model weighted by the kernel weights

\[\sum_{i = 1}^n K_h(x, x_i) \big( y_i - \beta_0 - \beta_1 x_i \big)^2\]

Local Polynomial Regression

The following examples are local polynomial regressions, evaluated as different target points. We can easily extend the local linear model to inccorperate higher orders terms:

\[\sum_{i=1}^n K_h(x, x_i) \Big[ y_i - \beta_0(x) - \sum_{r=1}^d \beta_j(x) x_i^r \Big]^2\]

The followings are local quadratic fittings, which will further correct the bias.

R Implementations

Some popular R functions implements the local polynomial regressions: loess, locfit, locploy, etc. These functions automatically calculate the fitted value for each target point (essentially all the observed points). This can be used in combination with ggplot2. The point-wise confidence intervals are also calculated.

    library(ggplot2)
    ggplot(mpg, aes(displ, hwy)) + geom_point() +
      geom_smooth(col = "red", method = "loess", span = 0.5)

A toy example that compares different bandwidth:

    # local polynomial fitting using locfit and locpoly
    
    library(KernSmooth)
    library(locfit)
    
    n <- 100
    x <- runif(n,0,1)
    y <- sin(2*pi*x)+rnorm(n,0,1)
    y = y[order(x)]
    x = sort(x)
    
    plot(x, y, pch = 19)
    points(x, sin(2*pi*x), lwd = 3, type = "l", col = 1)
    lines(locpoly(x,y,bandwidth=0.15,degree=2),col=2, lwd = 3)
    lines(locfit(y~lp(x, nn=0.2, h=0.05,deg=2)),col=4, lwd = 3)
    legend("topright", c("locpoly", "locfit"), col = c(2,4), lty = 1, cex = 1.5, lwd =2)

Kernel Density Estimations

A natural estimator, by using the counts, is

\[\widehat f(x) = \frac{\#\big\{x_i: x_i \in [x - \frac{h}{2}, x + \frac{h}{2}]\big\}}{hn}\] This maybe compared with the histgram estimator

    par(mfrow = c(1, 2), mar=rep(2, 4))
    hist(mpg$hwy, breaks = seq(6, 50, 2))
    xgrid = seq(6, 50, 0.1)
    plot(xgrid, sapply(xgrid, FUN = function(x, obs, h) sum( ((x-h/2) <= obs) * ((x+h/2) > obs))/h/length(obs), obs = mpg$hwy, h = 2), type = "s")

Here is a close-up demonstration of how those uniform density functions are stacked for all observations.

However, this is apparently very bumpy. Let’s consider using a smooth function instead of the counts (uniform kernel). Naturally, we can use the Gaussian kernel function to calculate the numerator in the above equation.

We further demonstrate this on the mpg dataset.

    xgrid = seq(6, 50, 0.1)
    kernelfun <- function(x, obs, h) sum(exp(-0.5*((x-obs)/h)^2)/sqrt(2*pi)/h)
    plot(xgrid, sapply(xgrid, FUN = kernelfun, obs = mpg$hwy, h = 1.5)/length(mpg$hwy), type = "l",
         xlab = "MPG", ylab = "Estimated Density", col = "darkorange", lwd = 3)