Chapter 15 Reproducing Kernel Hilbert Space

In the previous chapter of SVM, we gave an example to show that instead of using the inner product \(\langle \Phi(\mathbf{x}), \Phi(\mathbf{z}) \rangle\) between the feature maps of \(\mathbf{x}\) and \(\mathbf{z}\), we can instead use the kernel trick \(K(\mathbf{x}, \mathbf{z})\) to perform the exact same calculation. And also, in the penalized kernel version, we mentioned that the decision rule can be expressed in the finite sample form of \(\sum_{i = 1}^n \beta_i K(\cdot, \mathbf{x}_i)\) by the Representer Theorem. All of these are based on a fundamental tool of the Reproducing Kernel Hilbert Space and we will provide some basic knowledge of it. We will also prove the Representer Theorem (Kimeldorf and Wahba 1970), which is very similar to the proof of smoothing spline.

15.1 Constructing the RKHS

Recall that in the smoothing spline example, we wanted to fit a regression model by solving for a function \(f\) in a quite complicated space, the second order Sobolev space. We could not exhaust all the candidates in this space because that would be computationally untraceable. However, the results there shows that the solution has a finite representation. In general, when we solve a regression problem using a Loss \(+\) Penalty form, we will also enjoy that property if we search the (regression or decision) function \(f\) within a RKHS. So let’s first define what this space look like.

We start with the feature space \(\cal X\) of \(X\), where \(X\) is just the \(p\) dimensional feature we often deal with. Let’s say we have a sample \(x_1\), then if we have a kernel function \(k(\cdot, \cdot)\), we can construct a new function called \(K(x_1, \cdot)\). Keep in mind that \(K(x_1, \cdot)\) is a function with argument \(\cdot\) and parameter \(x_1\) in this case. Similarly, we can do another sample, say \(x_2\) and generate a function based on that sample, called \(K(x_1, \cdot)\). The following plot shows three of such functions, using red, orange and blue lines, receptively.

Since we can have many samples from \(\cal X\), we will also have infinite such functions like \(K(x, \cdot)\), and also the linear combinations of them would also be interesting to us. Let’s consider a space \({\cal G}\) of all such functions

\[{\cal G} = \left\{\sum_{i}^n \alpha_i K(x_i, \cdot) \mid \alpha \in \mathbb{R}, n \in \mathbb{N}, x_i \in {\cal X} \right\} \] The black curve in the previous plot is an example of such linear combinations. We can see that the functions within \({\cal G}\) start to become more and more flexible as we consider all the linear combinations. And as one final step, we will consider the completion of this space, which leads to the RKHS.

\[\cal H = \bar{\cal G}.\] Completion here means that \(\cal H\) will contain the limits of all Cauchy sequences of such functions in \(\cal G\).

15.2 Properties of RKHS

This space \(\cal H\) enjoys several important and useful properties. First, by the Riesz representation theorem, we know that \(\cal H\) is a Hilbert space with the reproducing property. For a (real valued) Hilbert space, it must satisfy

symmetric: \(\langle K_x, K_z \rangle = \langle K_z, K_x \rangle\)
linear: \(\langle a K_{x_1} + b K_{x_2}, K_z \rangle = a \langle K_{x_1}, K_z \rangle + b \langle K_{x_2}, K_z \rangle\)
positive definite: \(\langle K_x, K_x \rangle \geq 0\) and \(\langle K_x, K_x \rangle = 0\) iff \(K_x = 0\)

Also, the reproducing property means that when we evaluate a function \(f \in \cal H\) at a point \(x\), it is the same as calculating the inner product between \(f\) and \(K_x\). Formally,

\[f(x) = \langle f, K_x \rangle_{\cal H}\]

Now, we could simply take \(f = K_z\), that means, evaluating \(K_z(x)\) is

\[K_z(x) = \langle K_z, K_x \rangle_{\cal H}\] Note that \(K_z(x) = K(z, x)\), this implies that the inner product in \(\cal H\) is done by the kernel:

\[\langle K_z, K_x \rangle_{\cal H} = K(z, x)\] For example, if we have \(f(\cdot) = \sum_i \alpha_i K(x_i, \cdot)\), then evaluating \(f\) at \(x\) is

\[\begin{align} f(x) =& \, \langle f, K(x, \cdot) \rangle_{\cal H} \nonumber \\ =& \, \left\langle \sum_i \alpha_i K(x_i, \cdot), K(x, \cdot) \right\rangle_{\cal H} \nonumber \\ =& \, \sum_i \alpha_i \left\langle K(x_i, \cdot), K(x, \cdot) \right\rangle_{\cal H} \nonumber \\ =& \, \sum_i \alpha_i K(x_i, x) \end{align}\]

The Moore–Aronszajn theorem (Aronszajn 1950) ensures that a positive definite kernel \(K(\cdot, \cdot)\) on \(\cal X\) would uniquely define such a RKHS, where \(K(\cdot, \cdot)\) itself is the reproducing kernel. Hence, all we need is the original \(\cal X\) and a kernel function. Then the RKHS can be defined as we stated previously, with all the nice properties. Besides these, another results by Mercer interprets kernels as feature maps, which we have already see in the SVM chapter that \(K(x, z)= \langle \Phi(x), \Phi(z) \rangle\). Overall, we set some relationships among these three quantities in their respective spaces:

original features \(x\)
feature maps \(\Phi(x)\)
functions \(K(x, \cdot)\)

15.3 The Representer Theorem

\(\cal H\) is still a very large space of functions. And it is not clear if we want to find \(f\) in \(\cal H\) for our optimization problem, how do we computationally complete that task. It is unlikely that we can exhaust all such functions. Well, luckily, we don’t need to. This is ensured by the Representer Theorem, which states that only a finite sample presentation is needed.

Theorem 15.1 (Representer Theorem) If we are given a set of data \(\{x_i, y_i\}_{i=1}^n\), and we search for the best solution in \({\cal H}\) of the optimization problem \[\widehat f = \underset{f \in \cal H}{\arg\min} \,\, {\cal L}(\{y_i, f(x_i)\}_{i=1}^n) + p(\| f \|_{{\cal H}}^2 ),\] where \({\cal L}\) is the loss function, \(p\) is a monotone penalty function, and \({\cal H}\) is the RKHS with kernel \(K\). Then the solution must take the form \[\widehat f = \sum_{i=1}^n w_i K(\cdot, x_i)\]

The proof is quite simple. The logic is the same as the smoothing spline proof.

Proof. We can first use the kernel \(K\) associated with \(\cal H\) to define a set of functions \[K(\cdot, x_i), \, K(\cdot, x_2), \, \cdots, \, K(\cdot, x_n)\]

Then, suppose the solution is some function \(f \in {\cal H}\), we could find its projection on the space spaned by these functions. This means that we could write \(f\) as

\[f(\cdot) = \sum_{i=1}^n \alpha_i K(\cdot, x_i) + h(\cdot)\]

for some \(\alpha_i\) and \(h(\cdot)\). Also, since \(h(\cdot)\) is in the orthogonal space of all such \(K(\cdot, x_i)\), we have, by the reproducing property,

\[ h(x_i) = \langle K(x_i, \cdot), h(\cdot) \rangle = 0\]

for all \(i\). You may recall our proof in the smoothing spline for the same construction of \(h\) that has \(h(x_i) = 0\) for all \(i\). By the reproducing property, we have, for any observations in the training data,

\[\begin{align} f(x_j) =& \langle f(\cdot), K(\cdot, x_j) \rangle \nonumber \\ =& \left\langle \sum_{i=1}^n \alpha_i K(x_i, \cdot) + h(\cdot), K(\cdot, x_j) \right\rangle \nonumber \\ =& \sum_{i=1}^n \alpha_i K(x_i, x_j) + \sum_{i=1}^n \alpha_i h(x_j) \nonumber \\ =& \sum_{i=1}^n \alpha_i K(x_i, x_j) \end{align}\]

Which means that, the evaluation of \(f(x_j)\) would be the same as just evaluating it on this finite represtantation. Hence the loss function would be the same regardless of whether we have \(h\) or not. And also the penalty term of this finite represtantation would be better since

\[\begin{align} \lVert f \rVert^2 =& \lVert \sum_{i=1}^n \alpha_i K(\cdot, x_i) + h(\cdot) \rVert^2 \nonumber \\ =& \lVert \sum_{i=1}^n \alpha_i K(\cdot, x_i) \rVert^2 + \lVert h(\cdot) \rVert^2 \nonumber \\ \geq& \lVert \sum_{i=1}^n \alpha_i K(\cdot, x_i) \rVert^2 \end{align}\]

This completes the proof since this finite represetnation would be the one being prefered than \(f\).

Reference

Aronszajn, Nachman. 1950. “Theory of Reproducing Kernels.” Transactions of the American Mathematical Society 68 (3): 337–404.

Kimeldorf, George S, and Grace Wahba. 1970. “A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines.” The Annals of Mathematical Statistics 41 (2): 495–502.