In statistical modeling, we often face the problem of approximating a function \(f\) based on a set of observations. The goal is to define a function \(\hat f\) (using data) that closely resembles \(f\) in some sense, for example, if the true model is a linear model
\[ f(\bx) = \bbeta^\top \bx \]
where \(\bbeta\) is the true parameter. We have learned to use the linear regression \(\hat \bbeta = (\bX^\top \bX)^{-1} \bX^\T \by\) to estimate \(\bbeta\) from the data and we know that under suitable conditions, this \(\hat \bbeta\) has very nice properties, such as being unbiased and having the smallest variance among all linear estimators.
But oftentimes, we maybe dealing with a more complex function \(f\) that is not linear. To deal with such (unknown) functions, there are some strategies that we can use to approximate \(f\) such as
When we consider what type of method to use, the following aspects are important to consider:
Considering all these aspects, one of the most popular methods is the Reproducing Kernel Hilbert Space (RKHS) method, which is a very special case of the basis expansion method that enjoying many nice properties. We will introduce the RKHS method and how it can be used to approximate a function.
It could be helpful to first understand how a functional approximating method works in practice. We will use this an example to demonstrate how to numerically estimate the model fitting and predict new observations. Let’s consider a simple example in which we want to approximate the (unknown true) function \(f(x) = \sin(x)\) using a polynomial basis expansion. We will try to do this with our own code as much as possible rather than using existing packages.
# generate some data points
set.seed(123)
n = 50
x <- seq(-pi, pi, length.out = n)
y <- 0.5 + sin(x) + rnorm(length(x), sd = 0.2)
# create a data frame for the observations
# including intercept
train_data <- data.frame(x = x, y = y)
# plot the data points
plot(train_data$x, train_data$y, pch = 19, col = "blue", xlab = "x", ylab = "y", main = "Data Points")
A typical statistical modeling would follow the following steps:
If we consider a polynomial basis expansion of degree \(d\), the function can be approximated as
\[ f(x) = \sum_{j=0}^d \beta_j x^j \]
where \(\beta_j\)’s are the coefficients to be estimated. The loss function based on a set of observations \(\{x_i, y_i\}_{i=1}^n\) can be written as
\[\begin{align} L(\bbeta) &= \frac{1}{n} \sum_{i=1}^n \left(y_i - f(x_i) \right)^2 \\ &= \frac{1}{n} \sum_{i=1}^n \left( y_i - \sum_{j=0}^d \beta_j x_i^j \right)^2 \end{align}\]
Let’s consider a polynomial basis expansion of degree \(d = 3\) for the function \(f\). Hence, to calculate the previous loss function on the observed data, we simply need to calculate the power of \(x_i\) up to the third power and construct the corresponding new design matrix of this polynomial regression. The columns of this design matrix would include:
# the new design matrix
X = cbind(1, train_data$x, train_data$x^2, train_data$x^3)
beta_hat = solve(t(X) %*% X) %*% t(X) %*% train_data$y
cat("Estimated coefficients:\n")
## Estimated coefficients:
beta_hat
## [,1]
## [1,] 0.485874303
## [2,] 0.865520516
## [3,] 0.006134783
## [4,] -0.094449751
# fitted values
yfit <- X %*% beta_hat
plot(train_data$x, train_data$y, pch = 19, col = "blue", xlab = "x", ylab = "y", main = "Data Points")
lines(train_data$x, yfit, col = "red")
But of course, this would not work if the true functional form is slightly more complicated. Let’s see a simple example within \([ - 2\pi, 2\pi ]\) range. We can see that the polynomial basis expansion of degree 3 is not flexible enough to capture the true function. It takes at least 5th degree to obtain a reasonable fit. We use the poly()
function within lm()
to implement this.
# generate some data points
set.seed(123)
x <- seq(-2*pi, 2*pi, length.out = n)
y <- 0.5 + sin(x) + rnorm(length(x), sd = 0.2)
# create a data frame for the observations
# including intercept
train_data <- data.frame(x = x, y = y)
# the testing data
x_test = sort(seq(-2*pi, 2*pi, length.out = 200))
test_data = data.frame(x = x_test)
# fit the polynomial regression model of degree 3
poly3.fit <- lm(y ~ poly(x, degree = 3), data = train_data)
y_pred <- predict(poly3.fit, newdata = test_data)
# plot the data points and the fitted line
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
plot(train_data$x, train_data$y, pch = 19, col = "blue", xlab = "x", ylab = "y", main = "Data Points")
lines(x_test, y_pred, col = "red")
# use 5th degree
poly5.fit <- lm(y ~ poly(x, degree = 5), data = train_data)
y_pred <- predict(poly5.fit, newdata = test_data)
plot(train_data$x, train_data$y, pch = 19, col = "blue", xlab = "x", ylab = "y", main = "Data Points")
lines(x_test, y_pred, col = "red")
The polynomial basis expansion can be viewed as global basis functions (another example could be fourier basis functions), but we could also consider basis functions defined locally. A straightforward idea is the B-spline model, with adaptively chosen knots. You may refer to this note for an implementation. Alternatively, let’s consider a construction of basis using kernel functions (not polynomials as in the B-spline), which is more relative to RKHS in our later lectures. But be aware that this idea of using kernel functions is very different from the local averaging idea from a kernel smoother, such as the Nadaraya-Watson kernel regression. To be concrete, we are interested in defining a functional basis using a Gaussian density function. Noticing that our goal is to minimize the empirical loss function on the data that we observe, our loss function looks like
\[ \text{Loss} = \frac{1}{n} \sum_{i=1}^n \left( y_i - \sum_{j=1}^4 K(x_i, \xi_j) \right)^2 \]
where we use the Gaussian kernel \(K(x, \xi) = \exp( -(x-\xi)^2 / \sigma^2 )\). The constant term does not matter in this case since the linear regression step will automatically correct the scale. Now, to implement this numerically, we will
For practice, let’s use four number of knots:
# generate some data points
set.seed(123)
x <- seq(-pi, pi, length.out = n)
y <- 0.5 + sin(x) + rnorm(length(x), sd = 0.2)
# create a data frame for the observations
# including intercept
train_data <- data.frame(x = x, y = y)
# find four knots based on quantiles
nknots = 4
knots <- quantile(train_data$x, probs = (1:nknots)/(nknots + 1)) # 4 knots
# define the basis functions
basis <- function(x, knots) {
sapply(knots, function(knot) dnorm(x, mean = knot, sd = 0.5))
}
# construct the design matrix
design_matrix <- matrix(NA, nrow = n, ncol = nknots+1)
design_matrix[, 1] <- 1 # intercept
for (i in 1:nknots) {
design_matrix[, i+1] <- basis(train_data$x, knots[i])
}
# plot the data
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
plot(train_data$x, train_data$y, pch = 19, col = "blue", xlab = "x", ylab = "y", main = "Data Points")
# plot the basis function
plot(train_data$x, train_data$y, pch = NA, col = "blue", xlab = "x", ylab = "y", main = "Kernel Basis")
lines(train_data$x, design_matrix[, 2], col = "red", lwd = 2)
lines(train_data$x, design_matrix[, 3], col = "green", lwd = 2)
lines(train_data$x, design_matrix[, 4], col = "purple", lwd = 2)
lines(train_data$x, design_matrix[, 5], col = "orange", lwd = 2)
Fitting this linear regression gives a reasonably good result
# the estimated beta
beta_hat = solve(t(design_matrix) %*% design_matrix) %*% t(design_matrix) %*% train_data$y
# define testing data
ntest = 200
x_test = sort(seq(-pi, pi, length.out = ntest))
# construct the testing data design matrix
test_design_matrix <- matrix(NA, nrow = ntest, ncol = nknots+1)
test_design_matrix[, 1] <- 1 # intercept
for (i in 1:nknots) {
test_design_matrix[, i+1] <- basis(x_test, knots[i])
}
# fitted y values
y_pred <- test_design_matrix %*% beta_hat
# plot results
plot(train_data$x, train_data$y, pch = 19, col = "blue", xlab = "x", ylab = "y", main = "Fitted Values")
lines(x_test, beta_hat[2]*test_design_matrix[, 2] + beta_hat[1], col = "red", lwd = 2)
lines(x_test, beta_hat[3]*test_design_matrix[, 3] + beta_hat[1], col = "green", lwd = 2)
lines(x_test, beta_hat[4]*test_design_matrix[, 4] + beta_hat[1], col = "purple", lwd = 2)
lines(x_test, beta_hat[5]*test_design_matrix[, 5] + beta_hat[1], col = "orange", lwd = 2)
lines(x_test, y_pred, lwd = 4)
This result may not be very surprising since a spline model (e.g., cubic spline) can achieve similar performances. However, things can quickly get complicated when the dimension of the input variable increases. For example, the thin plate spline and tensor product splines can still be valid choices, but when the input variable is high-dimensional, it is difficult to define a reasonable basis function that can capture the true functional form. In this case, kernel basis functions are easy to implement and flexible enough in most cases.
As the functional form gets more complicated, we can increase the number of knots to obtain a better fit, as we can see in the previous example. There are several questions we could ask while performing this task
Let’s consider a kernel function in the following form
\[ K(x, \xi) = \frac{\sqrt{3}}{3} \exp\left( - \sqrt{3} |x - \xi| / 2 \right) \sin \left( |x - \xi| / 2 + \frac{\pi}{6} \right) \]
Note: there are some interesting connection showing that a second-order Sobolev space (see footnote 1 here) in 1-D would be equivalent to an RKHS (See Berlinet and Thomas-Agnan 2011, sec. 6.1.6 and a statement here). The connection is very subtle and would vary depends on what domain and norm we use for the Sobolev space. Details are probably not crucial and omitted here, but we will use the Sobolev space again in the spline lecture, which is an example of the representor theorem, and also motivates the number of knots to be chosen.
# generate some data points
set.seed(123)
n = 100
x <- seq(-2*pi, 2*pi, length.out = n)
y <- 0.5 + sin(x) + rnorm(length(x), sd = 0.2)
# create a data frame for the observations
# including intercept
train_data <- data.frame(x = x, y = y)
# find 40 knots based on quantiles
nknots = 40
knots <- quantile(train_data$x, probs = (1:nknots)/(nknots + 1))
# define the basis functions
basis <- function(x, knots) {
sapply(knots, function(knot) {
(sqrt(3)/3) * exp(-sqrt(3) * abs(x - knot) / 2) *
sin(abs(x - knot) / 2 + pi/6)
})
}
# construct the design matrix
design_matrix <- matrix(NA, nrow = n, ncol = nknots+1)
design_matrix[, 1] <- 1 # intercept
for (i in 1:nknots) {
design_matrix[, i+1] <- basis(train_data$x, knots[i])
}
# the new design matrix
beta_hat = solve(t(design_matrix) %*% design_matrix) %*% t(design_matrix) %*% train_data$y
# define testing data
ntest = 200
x_test = sort(seq(-2*pi, 2*pi, length.out = ntest))
# construct the testing data design matrix
test_design_matrix <- matrix(NA, nrow = ntest, ncol = nknots+1)
test_design_matrix[, 1] <- 1 # intercept
for (i in 1:nknots) {
test_design_matrix[, i+1] <- basis(x_test, knots[i])
}
# fitted y values
y_pred <- test_design_matrix %*% beta_hat
# plot the data
plot(train_data$x, train_data$y, pch = 19, col = "blue",
xlim = c(-2*pi, 2*pi),
xlab = "x", ylab = "y", main = "Fitted Values with 40 Knots")
lines(x_test, y_pred, lwd = 4)
In this final step, let’s think about what we want to do when the number of knots increases, possibly even greater than the number of observations. In this case, the design matrix will be singular and we cannot use the linear regression to estimate the coefficients. One way to deal with this is to use a regularization method, such as ridge regression or LASSO, which will penalize the size of the coefficients and prevent overfitting. Well, a ridge regression here is much more suitable since it natually handles the multicollinearity issue when the number of knots is large.
# generate some data points
set.seed(123)
n = 100
x <- seq(-2*pi, 2*pi, length.out = n)
y <- 0.5 + sin(x) + rnorm(length(x), sd = 0.2)
# create a data frame for the observations
# including intercept
train_data <- data.frame(x = x, y = y)
# define knots on all observed x values
nknots = length(train_data$x)
knots <- sort(train_data$x)
# define the basis functions
basis <- function(x, knots) {
sapply(knots, function(knot) {
(sqrt(3)/3) * exp(-sqrt(3) * abs(x - knot) / 2) *
sin(abs(x - knot) / 2 + pi/6)
})
}
# construct the design matrix
design_matrix <- matrix(NA, nrow = n, ncol = nknots+1)
design_matrix[, 1] <- 1 # intercept
for (i in 1:nknots) {
design_matrix[, i+1] <- basis(train_data$x, knots[i])
}
# the ridge regression fit
lambda = 0.01
beta_hat = solve(t(design_matrix) %*% design_matrix + lambda*diag(nknots+1)) %*% t(design_matrix) %*% train_data$y
# define testing data
ntest = 200
x_test = sort(seq(-2*pi, 2*pi, length.out = ntest))
# construct the testing data design matrix
test_design_matrix <- matrix(NA, nrow = ntest, ncol = nknots+1)
test_design_matrix[, 1] <- 1 # intercept
for (i in 1:nknots) {
test_design_matrix[, i+1] <- basis(x_test, knots[i])
}
# fitted y values
y_pred <- test_design_matrix %*% beta_hat
# plot the data
plot(train_data$x, train_data$y, pch = 19, col = "blue",
xlim = c(-2*pi, 2*pi),
xlab = "x", ylab = "y", main = "Regularized Fitted Values")
lines(x_test, y_pred, lwd = 4)
References: