\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\cA}{\mathcal{A}}\) \(\newcommand{\cB}{\mathcal{B}}\) \(\newcommand{\cC}{\mathcal{C}}\) \(\newcommand{\cD}{\mathcal{D}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cF}{\mathcal{F}}\) \(\newcommand{\cG}{\mathcal{G}}\) \(\newcommand{\cH}{\mathcal{H}}\) \(\newcommand{\cI}{\mathcal{I}}\) \(\newcommand{\cJ}{\mathcal{J}}\) \(\newcommand{\cK}{\mathcal{K}}\) \(\newcommand{\cL}{\mathcal{L}}\) \(\newcommand{\cM}{\mathcal{M}}\) \(\newcommand{\cN}{\mathcal{N}}\) \(\newcommand{\cO}{\mathcal{O}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\cQ}{\mathcal{Q}}\) \(\newcommand{\cR}{\mathcal{R}}\) \(\newcommand{\cS}{\mathcal{S}}\) \(\newcommand{\cT}{\mathcal{T}}\) \(\newcommand{\cU}{\mathcal{U}}\) \(\newcommand{\cV}{\mathcal{V}}\) \(\newcommand{\cW}{\mathcal{W}}\) \(\newcommand{\cX}{\mathcal{X}}\) \(\newcommand{\cY}{\mathcal{Y}}\) \(\newcommand{\cZ}{\mathcal{Z}}\) \(\newcommand{\bA}{\mathbf{A}}\) \(\newcommand{\bB}{\mathbf{B}}\) \(\newcommand{\bC}{\mathbf{C}}\) \(\newcommand{\bD}{\mathbf{D}}\) \(\newcommand{\bE}{\mathbf{E}}\) \(\newcommand{\bF}{\mathbf{F}}\) \(\newcommand{\bG}{\mathbf{G}}\) \(\newcommand{\bH}{\mathbf{H}}\) \(\newcommand{\bI}{\mathbf{I}}\) \(\newcommand{\bJ}{\mathbf{J}}\) \(\newcommand{\bK}{\mathbf{K}}\) \(\newcommand{\bL}{\mathbf{L}}\) \(\newcommand{\bM}{\mathbf{M}}\) \(\newcommand{\bN}{\mathbf{N}}\) \(\newcommand{\bO}{\mathbf{O}}\) \(\newcommand{\bP}{\mathbf{P}}\) \(\newcommand{\bQ}{\mathbf{Q}}\) \(\newcommand{\bR}{\mathbf{R}}\) \(\newcommand{\bS}{\mathbf{S}}\) \(\newcommand{\bT}{\mathbf{T}}\) \(\newcommand{\bU}{\mathbf{U}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\ba}{\mathbf{a}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\bd}{\mathbf{d}}\) \(\newcommand{\be}{\mathbf{e}}\) \(\newcommand{\bg}{\mathbf{g}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bi}{\mathbf{i}}\) \(\newcommand{\bj}{\mathbf{j}}\) \(\newcommand{\bk}{\mathbf{k}}\) \(\newcommand{\bl}{\mathbf{l}}\) \(\newcommand{\bm}{\mathbf{m}}\) \(\newcommand{\bn}{\mathbf{n}}\) \(\newcommand{\bo}{\mathbf{o}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bq}{\mathbf{q}}\) \(\newcommand{\br}{\mathbf{r}}\) \(\newcommand{\bs}{\mathbf{s}}\) \(\newcommand{\bt}{\mathbf{t}}\) \(\newcommand{\bu}{\mathbf{u}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\RR}{\mathbb{R}}\) \(\newcommand{\NN}{\mathbb{N}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\hpi}{\widehat{\pi}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\hbpi}{\widehat{\boldsymbol{\pi}}}\) \(\newcommand{\bxi}{\boldsymbol{\xi}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bzero}{\mathbf{0}}\) \(\newcommand{\T}{\text{T}}\) \(\newcommand{\Trace}{\text{Trace}}\) \(\newcommand{\Cov}{\text{Cov}}\) \(\newcommand{\Var}{\text{Var}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\Pr}{\text{Pr}}\) \(\newcommand{\pr}{\text{pr}}\) \(\newcommand{\pdf}{\text{pdf}}\) \(\newcommand{\P}{\text{P}}\) \(\newcommand{\p}{\text{p}}\) \(\newcommand{\One}{\mathbf{1}}\) \(\newcommand{\argmin}{\operatorname*{arg\,min}}\) \(\newcommand{\argmax}{\operatorname*{arg\,max}}\) \(\newcommand{\dtheta}{\frac{\partial}{\partial\theta} }\) \(\newcommand{\ptheta}{\nabla_\theta}\) \(\newcommand{\alert}[1]{\color{darkorange}{#1}}\) \(\newcommand{\alertr}[1]{\color{red}{#1}}\) \(\newcommand{\alertb}[1]{\color{blue}{#1}}\)

1 Approximating a Function

In statistical modeling, we often face the problem of approximating a function \(f\) based on a set of observations. The goal is to define a function \(\hat f\) (using data) that closely resembles \(f\) in some sense, for example, if the true model is a linear model

\[ f(\bx) = \bbeta^\top \bx \]

where \(\bbeta\) is the true parameter. We have learned to use the linear regression \(\hat \bbeta = (\bX^\top \bX)^{-1} \bX^\T \by\) to estimate \(\bbeta\) from the data and we know that under suitable conditions, this \(\hat \bbeta\) has very nice properties, such as being unbiased and having the smallest variance among all linear estimators.

But oftentimes, we maybe dealing with a more complex function \(f\) that is not linear. To deal with such (unknown) functions, there are some strategies that we can use to approximate \(f\) such as

  • Using higher order polynomials or nonlinear transformations of the input variables, such as \(x^2\), \(\log(x)\), \(\sin(x)\), \(I(x > 3)\), etc. Broadly speaking, this is called basis expansion.
  • Using local averages, such as \(k\)-Nearest Neighbors (KNN) or kernel smoothing. This also includes adaptive kernels such as random forests
  • Neural networks, which can be viewed as a more complex form of basis expansion with many layers and nonlinear transformations.

When we consider what type of method to use, the following aspects are important to consider:

  • Flexibility: How flexible is the method in capturing the true function? For example, choosing a second-order polynomial may not be flexible enough to capture a complex function.
  • High-dimensionality: Can the method handle high-dimensional data? For example, high-order polynomials may add too much burden to the model when the dimensionality is high, leading to overfitting.
  • Computational cost: Are the parameters easy to estimate? How much computational resources are required to fit the model? For example, neural networks require more computational resources than linear regression.
  • Theoretical guarantees: Is it easy to analyze the properties of the estimator? For example, linear regression has well-established theoretical properties because we know the analytic solution, while neural networks may not have such guarantees.

Considering all these aspects, one of the most popular methods is the Reproducing Kernel Hilbert Space (RKHS) method, which is a very special case of the basis expansion method that enjoying many nice properties. We will introduce the RKHS method and how it can be used to approximate a function.

2 Example: Polynomial Basis

It could be helpful to first understand how a functional approximating method works in practice. We will use this an example to demonstrate how to numerically estimate the model fitting and predict new observations. Let’s consider a simple example in which we want to approximate the (unknown true) function \(f(x) = \sin(x)\) using a polynomial basis expansion. We will try to do this with our own code as much as possible rather than using existing packages.

  # generate some data points
  set.seed(123)

  n = 50
  x <- seq(-pi, pi, length.out = n)
  y <- 0.5 + sin(x) + rnorm(length(x), sd = 0.2)
  
  # create a data frame for the observations
  # including intercept 
  train_data <- data.frame(x = x, y = y)
  
  # plot the data points
  plot(train_data$x, train_data$y, pch = 19, col = "blue", xlab = "x", ylab = "y", main = "Data Points")

A typical statistical modeling would follow the following steps:

  • Specify a model form, such as a polynomial basis expansion of degree \(d\)
  • Specify an empirical loss function, such as the squared error loss, based on the observations
  • Optimize the loss function on the observed data to find the best parameter estimates
  • Predict the outcome at new (testing) points using the estimated functional form

If we consider a polynomial basis expansion of degree \(d\), the function can be approximated as

\[ f(x) = \sum_{j=0}^d \beta_j x^j \]

where \(\beta_j\)’s are the coefficients to be estimated. The loss function based on a set of observations \(\{x_i, y_i\}_{i=1}^n\) can be written as

\[\begin{align} L(\bbeta) &= \frac{1}{n} \sum_{i=1}^n \left(y_i - f(x_i) \right)^2 \\ &= \frac{1}{n} \sum_{i=1}^n \left( y_i - \sum_{j=0}^d \beta_j x_i^j \right)^2 \end{align}\]

Let’s consider a polynomial basis expansion of degree \(d = 3\) for the function \(f\). Hence, to calculate the previous loss function on the observed data, we simply need to calculate the power of \(x_i\) up to the third power and construct the corresponding new design matrix of this polynomial regression. The columns of this design matrix would include:

  • \(x_i^0\) is a column of 1 — the intercept
  • \(x_i^1\) is the original \(x_i\) — the linear term
  • \(x_i^2\) is the square of \(x_i\) — the quadratic term
  • \(x_i^3\) is the cube of \(x_i\) — the cubic term
  # the new design matrix
  X = cbind(1, train_data$x, train_data$x^2, train_data$x^3)
  beta_hat = solve(t(X) %*% X) %*% t(X) %*% train_data$y
  cat("Estimated coefficients:\n")
## Estimated coefficients:
  beta_hat
##              [,1]
## [1,]  0.485874303
## [2,]  0.865520516
## [3,]  0.006134783
## [4,] -0.094449751
  
  # fitted values
  yfit <- X %*% beta_hat
  
  plot(train_data$x, train_data$y, pch = 19, col = "blue", xlab = "x", ylab = "y", main = "Data Points")
  lines(train_data$x, yfit, col = "red")

But of course, this would not work if the true functional form is slightly more complicated. Let’s see a simple example within \([ - 2\pi, 2\pi ]\) range. We can see that the polynomial basis expansion of degree 3 is not flexible enough to capture the true function. It takes at least 5th degree to obtain a reasonable fit. We use the poly() function within lm() to implement this.

  # generate some data points
  set.seed(123)
  x <- seq(-2*pi, 2*pi, length.out = n)
  y <- 0.5 + sin(x) + rnorm(length(x), sd = 0.2)
  
  # create a data frame for the observations
  # including intercept 
  train_data <- data.frame(x = x, y = y)
  
  # the testing data
  x_test = sort(seq(-2*pi, 2*pi, length.out = 200))
  test_data = data.frame(x = x_test)
  
  # fit the polynomial regression model of degree 3
  poly3.fit <- lm(y ~ poly(x, degree = 3), data = train_data)
  y_pred <- predict(poly3.fit, newdata = test_data)

  # plot the data points and the fitted line
  par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))  
  plot(train_data$x, train_data$y, pch = 19, col = "blue", xlab = "x", ylab = "y", main = "Data Points")
  lines(x_test, y_pred, col = "red")
  
  # use 5th degree
  poly5.fit <- lm(y ~ poly(x, degree = 5), data = train_data)
  y_pred <- predict(poly5.fit, newdata = test_data)
  plot(train_data$x, train_data$y, pch = 19, col = "blue", xlab = "x", ylab = "y", main = "Data Points")
  lines(x_test, y_pred, col = "red")

3 Example: Adaptive Kernel Basis

The polynomial basis expansion can be viewed as global basis functions (another example could be fourier basis functions), but we could also consider basis functions defined locally. A straightforward idea is the B-spline model, with adaptively chosen knots. You may refer to this note for an implementation. Alternatively, let’s consider a construction of basis using kernel functions (not polynomials as in the B-spline), which is more relative to RKHS in our later lectures. But be aware that this idea of using kernel functions is very different from the local averaging idea from a kernel smoother, such as the Nadaraya-Watson kernel regression. To be concrete, we are interested in defining a functional basis using a Gaussian density function. Noticing that our goal is to minimize the empirical loss function on the data that we observe, our loss function looks like

\[ \text{Loss} = \frac{1}{n} \sum_{i=1}^n \left( y_i - \sum_{j=1}^4 K(x_i, \xi_j) \right)^2 \]

where we use the Gaussian kernel \(K(x, \xi) = \exp( -(x-\xi)^2 / \sigma^2 )\). The constant term does not matter in this case since the linear regression step will automatically correct the scale. Now, to implement this numerically, we will

  • Define a set of knots \(\xi_j\)’s, which are the points where the Gaussian basis functions will be centered around
  • Calculate the basis functions evaluated at the observed data points \(x_i\)’s, which forms the new covariates
  • Perform regular linear regression based on these newly constructed covariates
  • For any new testing points, first construct the basis functions on these points, and use the fitted linear regression to calculate the prediction

For practice, let’s use four number of knots:

  # generate some data points
  set.seed(123)
  x <- seq(-pi, pi, length.out = n)
  y <- 0.5 + sin(x) + rnorm(length(x), sd = 0.2)
  
  # create a data frame for the observations
  # including intercept 
  train_data <- data.frame(x = x, y = y)
  
  # find four knots based on quantiles
  nknots = 4
  knots <- quantile(train_data$x, probs = (1:nknots)/(nknots + 1)) # 4 knots
  
  # define the basis functions
  basis <- function(x, knots) {
    sapply(knots, function(knot) dnorm(x, mean = knot, sd = 0.5))
  }
  
  # construct the design matrix
  design_matrix <- matrix(NA, nrow = n, ncol = nknots+1)
  design_matrix[, 1] <- 1 # intercept
  for (i in 1:nknots) {
    design_matrix[, i+1] <- basis(train_data$x, knots[i])
  }
  
  # plot the data
  par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))  
  plot(train_data$x, train_data$y, pch = 19, col = "blue", xlab = "x", ylab = "y", main = "Data Points")
  
  # plot the basis function
  plot(train_data$x, train_data$y, pch = NA, col = "blue", xlab = "x", ylab = "y", main = "Kernel Basis")
  lines(train_data$x, design_matrix[, 2], col = "red", lwd = 2)
  lines(train_data$x, design_matrix[, 3], col = "green", lwd = 2)
  lines(train_data$x, design_matrix[, 4], col = "purple", lwd = 2)
  lines(train_data$x, design_matrix[, 5], col = "orange", lwd = 2)

Fitting this linear regression gives a reasonably good result

  # the estimated beta
  beta_hat = solve(t(design_matrix) %*% design_matrix) %*% t(design_matrix) %*% train_data$y

  # define testing data
  ntest = 200
  x_test = sort(seq(-pi, pi, length.out = ntest))
  
  # construct the testing data design matrix
  test_design_matrix <- matrix(NA, nrow = ntest, ncol = nknots+1)
  test_design_matrix[, 1] <- 1 # intercept
  for (i in 1:nknots) {
    test_design_matrix[, i+1] <- basis(x_test, knots[i])
  }
  
  # fitted y values
  y_pred <- test_design_matrix %*% beta_hat

  # plot results
  plot(train_data$x, train_data$y, pch = 19, col = "blue", xlab = "x", ylab = "y", main = "Fitted Values")
  lines(x_test, beta_hat[2]*test_design_matrix[, 2] + beta_hat[1], col = "red", lwd = 2)
  lines(x_test, beta_hat[3]*test_design_matrix[, 3] + beta_hat[1], col = "green", lwd = 2)
  lines(x_test, beta_hat[4]*test_design_matrix[, 4] + beta_hat[1], col = "purple", lwd = 2)
  lines(x_test, beta_hat[5]*test_design_matrix[, 5] + beta_hat[1], col = "orange", lwd = 2)
  lines(x_test, y_pred, lwd = 4)

This result may not be very surprising since a spline model (e.g., cubic spline) can achieve similar performances. However, things can quickly get complicated when the dimension of the input variable increases. For example, the thin plate spline and tensor product splines can still be valid choices, but when the input variable is high-dimensional, it is difficult to define a reasonable basis function that can capture the true functional form. In this case, kernel basis functions are easy to implement and flexible enough in most cases.

4 Example: Increasing the number of basis

As the functional form gets more complicated, we can increase the number of knots to obtain a better fit, as we can see in the previous example. There are several questions we could ask while performing this task

  • What kernel function to use
  • How many knots and where should they be located
  • How to prevent overfitting when the number of knots increases

Let’s consider a kernel function in the following form

\[ K(x, \xi) = \frac{\sqrt{3}}{3} \exp\left( - \sqrt{3} |x - \xi| / 2 \right) \sin \left( |x - \xi| / 2 + \frac{\pi}{6} \right) \]

Note: there are some interesting connection showing that a second-order Sobolev space (see footnote 1 here) in 1-D would be equivalent to an RKHS (See Berlinet and Thomas-Agnan 2011, sec. 6.1.6 and a statement here). The connection is very subtle and would vary depends on what domain and norm we use for the Sobolev space. Details are probably not crucial and omitted here, but we will use the Sobolev space again in the spline lecture, which is an example of the representor theorem, and also motivates the number of knots to be chosen.

  # generate some data points
  set.seed(123)
  n = 100
  x <- seq(-2*pi, 2*pi, length.out = n)
  y <- 0.5 + sin(x) + rnorm(length(x), sd = 0.2)
  
  # create a data frame for the observations
  # including intercept 
  train_data <- data.frame(x = x, y = y)
  
  # find 40 knots based on quantiles
  nknots = 40
  knots <- quantile(train_data$x, probs = (1:nknots)/(nknots + 1))
  
  # define the basis functions
  basis <- function(x, knots) {
    sapply(knots, function(knot) {
      (sqrt(3)/3) * exp(-sqrt(3) * abs(x - knot) / 2) * 
      sin(abs(x - knot) / 2 + pi/6)
    })
  }
  
  # construct the design matrix
  design_matrix <- matrix(NA, nrow = n, ncol = nknots+1)
  design_matrix[, 1] <- 1 # intercept
  for (i in 1:nknots) {
    design_matrix[, i+1] <- basis(train_data$x, knots[i])
  }

  # the new design matrix
  beta_hat = solve(t(design_matrix) %*% design_matrix) %*% t(design_matrix) %*% train_data$y

  # define testing data
  ntest = 200
  x_test = sort(seq(-2*pi, 2*pi, length.out = ntest))
  
  # construct the testing data design matrix
  test_design_matrix <- matrix(NA, nrow = ntest, ncol = nknots+1)
  test_design_matrix[, 1] <- 1 # intercept
  for (i in 1:nknots) {
    test_design_matrix[, i+1] <- basis(x_test, knots[i])
  }
  
  # fitted y values
  y_pred <- test_design_matrix %*% beta_hat

  # plot the data
  plot(train_data$x, train_data$y, pch = 19, col = "blue", 
       xlim = c(-2*pi, 2*pi),
       xlab = "x", ylab = "y", main = "Fitted Values with 40 Knots")
  lines(x_test, y_pred, lwd = 4)

In this final step, let’s think about what we want to do when the number of knots increases, possibly even greater than the number of observations. In this case, the design matrix will be singular and we cannot use the linear regression to estimate the coefficients. One way to deal with this is to use a regularization method, such as ridge regression or LASSO, which will penalize the size of the coefficients and prevent overfitting. Well, a ridge regression here is much more suitable since it natually handles the multicollinearity issue when the number of knots is large.

  # generate some data points
  set.seed(123)
  n = 100
  x <- seq(-2*pi, 2*pi, length.out = n)
  y <- 0.5 + sin(x) + rnorm(length(x), sd = 0.2)
  
  # create a data frame for the observations
  # including intercept 
  train_data <- data.frame(x = x, y = y)
  
  # define knots on all observed x values
  nknots = length(train_data$x)
  knots <- sort(train_data$x)
  
  # define the basis functions
  basis <- function(x, knots) {
    sapply(knots, function(knot) {
      (sqrt(3)/3) * exp(-sqrt(3) * abs(x - knot) / 2) * 
      sin(abs(x - knot) / 2 + pi/6)
    })
  }
  
  # construct the design matrix
  design_matrix <- matrix(NA, nrow = n, ncol = nknots+1)
  design_matrix[, 1] <- 1 # intercept
  for (i in 1:nknots) {
    design_matrix[, i+1] <- basis(train_data$x, knots[i])
  }

  # the ridge regression fit
  lambda = 0.01
  beta_hat = solve(t(design_matrix) %*% design_matrix + lambda*diag(nknots+1)) %*% t(design_matrix) %*% train_data$y

  # define testing data
  ntest = 200
  x_test = sort(seq(-2*pi, 2*pi, length.out = ntest))
  
  # construct the testing data design matrix
  test_design_matrix <- matrix(NA, nrow = ntest, ncol = nknots+1)
  test_design_matrix[, 1] <- 1 # intercept
  for (i in 1:nknots) {
    test_design_matrix[, i+1] <- basis(x_test, knots[i])
  }
  
  # fitted y values
  y_pred <- test_design_matrix %*% beta_hat

  # plot the data
  plot(train_data$x, train_data$y, pch = 19, col = "blue", 
       xlim = c(-2*pi, 2*pi),
       xlab = "x", ylab = "y", main = "Regularized Fitted Values")
  lines(x_test, y_pred, lwd = 4)


References:

Berlinet, Alain, and Christine Thomas-Agnan. 2011. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer Science & Business Media.