Chapter 13 Kernel Smoothing

Fundamental ideas of local regression approaches are similar to \(k\)NN. But most approaches would address a fundamental drawback of \(k\)NN that the estimated function is not smooth. Having a smoothed estimation would also allow us to estimate the derivative, which is essentially used when estimating the density function. We will start with the intuition of the kernel estimator and then discuss the bias-variance trade-off using kernel density estimation as an example.

13.1 KNN vs. Kernel

We first compare the \(K\)NN method with a Gaussian kernel regression. \(K\)NN has jumps while Gaussian kernel regression is smooth.

13.2 Kernel Density Estimations

A natural estimator, by using the counts, is

\[\widehat f(x) = \frac{\#\big\{x_i: x_i \in [x - \frac{h}{2}, x + \frac{h}{2}]\big\}}{h n}\] This maybe compared with the histogram estimator

    library(ggplot2)
    data(mpg)
    
    # histogram 
    hist(mpg$hwy, breaks = seq(6, 50, 2))
    
    # uniform kernel
    xgrid = seq(6, 50, 0.1)
    histden = sapply(xgrid, FUN = function(x, obs, h) sum( ((x-h/2) <= obs) * ((x+h/2) > obs))/h/length(obs), 
                     obs = mpg$hwy, h = 2)
    
    plot(xgrid, histden, type = "s")

This can be view in two ways. The easier interpretation is that, for each target point, we count how many observations are close-by. We can also interpret it as evenly distributing the point-mass of each observation to a close-by region with width \(h\), and then stack them all.

\[\widehat f(x) = \frac{1}{h n} \sum_i \,\underbrace{ \mathbf{1} \Big(x \in [x_i - \frac{h}{2}, x_i + \frac{h}{2}]}_{\text{uniform density centered at } x_i} \Big)\] Here is a close-up demonstration of how those uniform density functions are stacked for all observations.

However, this is will lead to jumps at the end of each small uniform density. Let’s consider using a smooth function instead. Naturally, we can use the Gaussian kernel function to calculate the numerator in the above equation.

We apply this to the mpg dataset.

  xgrid = seq(6, 50, 0.1)
  kernelfun <- function(x, obs, h) sum(exp(-0.5*((x-obs)/h)^2)/sqrt(2*pi)/h)
  plot(xgrid, sapply(xgrid, FUN = kernelfun, obs = mpg$hwy, h = 1.5)/length(mpg$hwy), type = "l",
       xlab = "MPG", ylab = "Estimated Density", col = "darkorange", lwd = 3)

The ggplot2 packages provides some convenient features to plot the density and histogram.

  ggplot(mpg, aes(x=hwy)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="white")+
    geom_density(alpha=.2, fill="#ff8c00")
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

13.3 Bias-variance trade-off

Let’s consider estimating a density, using the Parzen estimator

\[\widehat f(x) = \frac{1}{n} \sum_{i=1}^n K_{h} (x, x_i)\] here, \(K_h(\mathbf{u}, \mathbf{v}) = K(|\mathbf{u}- \mathbf{v}|/h)/h\) is a kernel function that satisfies

\(\int K(u)du = 1\) (a proper density)
\(K(-u) = K(u)\) (symmetric)
\(\int K(u) u^2 du \leq \infty\) (finite second moment)

Note that \(h\) simply scales the covariate and adjust the density accordingly. Our goal is to estimate a target point \(x\) using a set of iid data. First, we can analyze the bias:

\[\begin{align} \text{E}\big[ \widehat f(x) \big] &= \text{E}\left[ K\left( \frac{x - x_1}{h} \right) \Big/ h \right] \\ &= \int_{-\infty}^\infty \frac{1}{h} K\left(\frac{x-x_1}{h}\right) f(x_1) d x_1 \\ &= \int_{\infty}^{-\infty} \frac{1}{h} K(t) f(x - th) d (x-th) \\ (\text{Taylor expansion}) \quad &= f(x) + \frac{h^2}{2} f''(x) \int_{-\infty}^\infty K(t) t^2 dt + o(h^2) \\ (\text{as } ha \rightarrow 0) \quad &\rightarrow f(x) \end{align}\]

Since the density is over the entire domain, we can define the integrated Bias\(^2\):

\[\begin{align} \text{Bias}^2 &= \int \left( E[\widehat f(x)] - f(x)\right)^2 dx \\ &\approx \frac{h^4 \sigma_K^4}{4} \int \big[ f''(x)\big]^2 dx \end{align}\] where \(\sigma_K^2 = \int_{-\infty}^\infty K(t) t^2 dt\).

On the other hand, the variance term is

\[\begin{align} \text{Var}\big[ \widehat f(x) \big] &= \frac{1}{n} \text{Var}\Big[\frac{1}{h}K\big( \frac{x - x_1}{h} \big) \Big] \\ &= \frac{1}{n} \text{E}\bigg[ \frac{1}{h^2} K^2\big( \frac{x - x_1}{h}\big) - \text{E}\Big[ \frac{1}{h} K\big( \frac{x - x_1}{h} \big)\Big]^2 \bigg]\\ &= \frac{1}{n} \Big[ \int \frac{1}{h} K^2( \frac{x - x_1}{h} ) f(x_1) dx_1 + O(1) \Big] \\ &= \frac{1}{n} \Big[ \frac{1}{h} \int K^2( u ) f(x) du + O(1) \Big] \\ &= \frac{f(x)}{nh} \int K^2( u ) du \end{align}\]

with the integrated variance being

\[\frac{1}{nh} \int K^2( u ) dt \]

By minimizing the asymptotic mean integrated squared error (AMISE), defined as the sum of integrated Bias\(^2\) and variance, we have the optimal \(h\) being

\[h^\text{opt} = \bigg[\frac{1}{n} \frac{\int K^2(u)du}{ \sigma^2_K \int f''(u)du} \bigg]^{1/5},\]

and the optimal \(h\) is in the order of \(\cal O(n^{-4/5})\).

13.4 Gaussian Kernel Regression

A Nadaraya-Watson kernel regression model has the following formula. Note that we use \(h\) as the bandwidth instead of \(h\).

\[\widehat f(x) = \frac{\sum_i K_h(x, x_i) y_i}{\sum_i K_h(x, x_i)},\] where \(h\) is the bandwidth. At each target point \(x\), training data \(x_i\)s that are closer to \(x\) receives higher weights \(K_h(x, x_i)\), hence their \(y_i\) values are more influential in terms of estimating \(f(x)\). For the Gaussian kernel, we use

\[K_h(x, x_i) = \frac{1}{h\sqrt{2\pi}} \exp\left\{ -\frac{(x - x_i)^2}{2 h^2}\right\}\]

13.4.1 Bias-variance Trade-off

The bandwidth \(h\) is an important tuning parameter that controls the bias-variance trade-off. It behaves the same as the density estimation. By setting a large \(h\), the estimator is more stable but has more bias.

  # a small bandwidth
  ksmooth.fit1 = ksmooth(x, y, bandwidth = 0.5, kernel = "normal", x.points = testx)

  # a large bandwidth
  ksmooth.fit2 = ksmooth(x, y, bandwidth = 2, kernel = "normal", x.points = testx)
  
  # plot both
  plot(x, y, xlim = c(0, 2*pi), pch = 19, xaxt='n', yaxt='n')
  lines(testx, 2*sin(testx), col = "deepskyblue", lwd = 3)
  lines(testx, ksmooth.fit1$y, type = "s", col = "darkorange", lwd = 3)
  lines(testx, ksmooth.fit2$y, type = "s", col = "red", lwd = 3)
  legend("topright", c("h = 0.5", "h = 2"), col = c("darkorange", "red"), 
         lty = 1, lwd = 2, cex = 1.5)

13.5 Choice of Kernel Functions

Other kernel functions can also be used. The most efficient kernel is the Epanechnikov kernel, which will minimize the mean integrated squared error (MISE). The efficiency is defined as

\[ \Big(\int u^2K(u) du\Big)^\frac{1}{2} \int K^2(u) du, \] Different kernel functions can be visualized in the following. Most kernels are bounded within \([-h/2, h/2]\), except the Gaussian kernel.

13.6 Local Linear Regression

Local averaging will suffer severe bias at the boundaries. One solution is to use the local polynomial regression. The following examples are local linear regressions, evaluated as different target points. We are solving for a linear model weighted by the kernel weights

\[\sum_{i = 1}^n K_h(x, x_i) \big( y_i - \beta_0 - \beta_1 x_i \big)^2\]

13.7 Local Polynomial Regression

The following examples are local polynomial regressions, evaluated as different target points. We can easily extend the local linear model to inccorperate higher orders terms:

\[\sum_{i=1}^n K_h(x, x_i) \Big[ y_i - \beta_0(x) - \sum_{r=1}^d \beta_j(x) x_i^r \Big]^2\]

The followings are local quadratic fittings, which will further correct the bias.

13.8 R Implementations

Some popular R functions implements the local polynomial regressions: loess, locfit, locploy, etc. These functions automatically calculate the fitted value for each target point (essentially all the observed points). This can be used in combination with ggplot2. The point-wise confidence intervals are also calculated.

    ggplot(mpg, aes(displ, hwy)) + geom_point() +
      geom_smooth(col = "red", method = "loess", span = 0.5)
## `geom_smooth()` using formula = 'y ~ x'

A toy example that compares different bandwidth. Be careful that different methods may formulat the bandwidth parameter in different ways.

    # local polynomial fitting using locfit and locpoly
    
    library(KernSmooth)
    library(locfit)
    
    n <- 100
    x <- runif(n,0,1)
    y <- sin(2*pi*x)+rnorm(n,0,1)
    y = y[order(x)]
    x = sort(x)
    
    plot(x, y, pch = 19)
    points(x, sin(2*pi*x), lwd = 3, type = "l", col = 1)
    lines(locpoly(x, y, bandwidth=0.15, degree=2), col=2, lwd = 3)
    lines(locfit(y~lp(x, nn = 0.3, h=0.05, deg=2)), col=4, lwd = 3)
    
    legend("topright", c("locpoly", "locfit"), col = c(2,4), lty = 1, cex = 1.5, lwd =2)