Chapter 16 Kernel Ridge Regression
With our understandings of the RKHS and the representer theorem, we can now say that for any regression function models, if we want the solution to be more flexible, we may solve it within a RKHS. For example, consider the following regression problem:
\[\widehat f = \underset{f \in {\cal H}}{\arg\min} \,\, \frac{1}{n} \sum_{i=1}^n \Big(y_i - \widehat f(x_i) \Big)^2 + \lambda \lVert f \rVert_{\cal H}^2\] Since we know that the solution has to take the form
\[\widehat f = \sum_{i=1}^n \alpha_i K(x_i, \cdot),\] we can instead solve the problem as a ridge regression type of problem:
\[\widehat f = \underset{f \in {\cal H}}{\arg\min} \,\, \frac{1}{n} \big\lVert \mathbf{y}- \mathbf{K}\boldsymbol{\alpha}\big\rVert^2 + \lambda \lVert f \rVert_{\cal H}^2,\] where \(\mathbf{K}\) is an \(n \times n\) matrix with \(K(x_i, x_j)\) at its \((i,j)\)th element. With some simple calculation, we also have
\[\begin{align} \lVert f \rVert_{\cal H}^2 =& \langle f, f \rangle \nonumber \\ =& \langle \sum_{i=1}^n \alpha_i K(x_i, \cdot), \sum_{j=1}^n \alpha_j K(x_j, \cdot) \rangle \nonumber \\ =& \sum_{i, j} \alpha_i \alpha_j \big\langle K(x_i, \cdot), K(x_j, \cdot) \big\rangle \nonumber \\ =& \sum_{i, j} \alpha_i \alpha_j K(x_i, x_j) \nonumber \\ =& \boldsymbol{\alpha}^\text{T}\mathbf{K}\boldsymbol{\alpha} \end{align}\]
Hence, the problem becomes
\[\widehat f = \underset{f \in {\cal H}}{\arg\min} \,\, \frac{1}{n} \big\lVert \mathbf{y}- \mathbf{K}\boldsymbol{\alpha}\big\rVert^2 + \lambda \boldsymbol{\alpha}^\text{T}\mathbf{K}\boldsymbol{\alpha}.\] By taking the derivative with respect to \(\boldsymbol{\alpha}\), we have (note that \(\mathbf{K}\) is symmetric),
\[\begin{align} -\frac{1}{n} \mathbf{K}^\text{T}(\mathbf{y}- \mathbf{K}\boldsymbol{\alpha}) + \lambda \mathbf{K}\boldsymbol{\alpha}\overset{\text{set}}{=} \mathbf{0} \nonumber \\ \mathbf{K}(- \mathbf{y}+ \mathbf{K}\boldsymbol{\alpha}+ n\lambda \boldsymbol{\alpha}) = \mathbf{0}. \end{align}\] This implies
\[ \boldsymbol{\alpha}= (\mathbf{K}+ n\lambda \mathbf{I})^{-1} \mathbf{y}.\] and we obtained the solution.
16.1 Example: Linear Kernel and Ridge Regression
When \(K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^\text{T}\mathbf{x}_j\), we also have \(\mathbf{K}= \mathbf{X}\mathbf{X}^\text{T}\). We should expect this to match the original ridge regression since this is essentially a linear regression. First, plug this into our previous result, we have
\[ \boldsymbol{\alpha}= (\mathbf{X}\mathbf{X}^\text{T}+ n\lambda \mathbf{I})^{-1} \mathbf{y}.\] and the fitted value is
\[ \widehat{\mathbf{y}} = \mathbf{K}\boldsymbol{\alpha}= \mathbf{X}\mathbf{X}^\text{T}(\mathbf{X}\mathbf{X}^\text{T}+ n\lambda \mathbf{I})^{-1} \mathbf{y}\] Using a matrix identity \((\mathbf{P}\mathbf{Q}+ \mathbf{I})^{-1}\mathbf{P}= \mathbf{P}(\mathbf{Q}\mathbf{P}+ \mathbf{I})^{-1}\), and let \(\mathbf{Q}= \mathbf{X}= \mathbf{P}^\text{T}\), we have
\[ \widehat{\mathbf{y}} = \mathbf{K}\boldsymbol{\alpha}= \mathbf{X}(\mathbf{X}^\text{T}\mathbf{X}+ n\lambda \mathbf{I})^{-1} \mathbf{X}^\text{T}\mathbf{y}\] and
\[ \widehat{\mathbf{y}} = \mathbf{X}\underbrace{\big[ \mathbf{X}^\text{T}\boldsymbol{\alpha}\big]}_{\boldsymbol{\beta}} = \mathbf{X}\underbrace{\big[ (\mathbf{X}^\text{T}\mathbf{X}+ n\lambda \mathbf{I})^{-1} \mathbf{X}^\text{T}\mathbf{y}\big]}_{\boldsymbol{\beta}}\]
which is simply the Ridge regression solution, and also the corresponding linear regression solution \(\widehat{\boldsymbol{\beta}} = \mathbf{X}^\text{T}\widehat{\boldsymbol{\alpha}}\). This makes the penalty term \(\boldsymbol{\alpha}^\text{T}\mathbf{K}\boldsymbol{\alpha}= \boldsymbol{\alpha}^\text{T}\mathbf{X}\mathbf{X}^\text{T}\boldsymbol{\alpha}= \boldsymbol{\beta}^\text{T}\boldsymbol{\beta}\), which maps every thing back to the ridge regression form.
16.2 Example: Alternative View
This example is motivated from an alternative derivation provided by Prof. Max Welling on his kernel ridge regression lecture note. This understanding matches the SVM primal to dual derivation, but is performed on a linear regression. We can then again switch things to the kernel version (through kernel trick).
Consider a linear regression
\[\underset{\boldsymbol{\beta}}{\text{minimize}} \,\, \frac{1}{n} \lVert \mathbf{y}- \mathbf{X}\boldsymbol{\beta}\rVert^2 + \lambda \lVert \boldsymbol{\beta}\rVert^2\]
Introduce a new set of variables
\[z_i = y_i - \mathbf{x}_i^\text{T}\boldsymbol{\beta},\] for \(i = 1, \ldots, n\). Then The original problem becomes
\[\begin{align} \underset{\boldsymbol{\beta}, \mathbf{z}}{\text{minimize}} \quad & \frac{1}{2n\lambda} \lVert \mathbf{z}\rVert^2 + \frac{1}{2} \lVert \boldsymbol{\beta}\rVert^2 \nonumber \\ \text{subj. to} \quad & z_i = y_i - \mathbf{x}_i^\text{T}\boldsymbol{\beta}, \,\, i = 1, \ldots, n. \end{align}\]
If we use the same strategy from the SVM derivation, we have the Lagrangian
\[{\cal L} = \frac{1}{2n\lambda} \lVert \mathbf{z}\rVert^2 + \frac{1}{2} \lVert \boldsymbol{\beta}\rVert^2 + \sum_{i=1}^n \alpha_i (y_i - \mathbf{x}_i^\text{T}\boldsymbol{\beta}- z_i)\] with \(\alpha_i \in \mathbb{R}\). Switching from primal to dual, by taking derivative w.r.t. \(\boldsymbol{\beta}\) and \(\mathbf{z}\), we have
\[\begin{align} \frac{\partial \cal L}{\partial z_i} =&\, \frac{1}{n\lambda}z_i - \alpha_i = 0, \quad \text{for} \,\, i = 1, \ldots, n, \nonumber \\ \text{and}\,\, \frac{\partial \cal L}{\partial \boldsymbol{\beta}} =&\, \boldsymbol{\beta}- \sum_{i=1}^n \alpha_i \mathbf{x}_i = \mathbf{0} \end{align}\]
Hence, we have, the estimated \(\widehat{\boldsymbol{\beta}}\) is \(\sum_{i=1}^n \alpha_i \mathbf{x}_i\) that matches our previous understanding. Also, if we view this as a linear kernel solution, the predicted value of at \(\mathbf{x}\) is
\[\begin{align} f(\mathbf{x}) =& \,\, \mathbf{x}^\text{T}\boldsymbol{\beta}\nonumber \\ =& \sum_{i=1}^n \alpha_i \mathbf{x}^\text{T}\mathbf{x}_i \nonumber \\ =& \sum_{i=1}^n \alpha_i K(\mathbf{x}, \mathbf{x}_i). \end{align}\]
Now, to complete our dual solution, we plugin these results, and have
\[\begin{align} \underset{\boldsymbol{\alpha}}{\max} \underset{\mathbf{z}, \boldsymbol{\beta}}{\min} {\cal L} =& \frac{n\lambda}{2} \boldsymbol{\alpha}^\text{T}\boldsymbol{\alpha}+ \frac{1}{2} \sum_{i, j} \alpha_i \alpha_j x_i^\text{T}x_j + \sum_{j} \alpha_j \big(y_j - x_j^\text{T}\sum_i \alpha_i \mathbf{x}_i - n\lambda \alpha_i \big) \nonumber \\ =& - \frac{n\lambda}{2} \boldsymbol{\alpha}^\text{T}\boldsymbol{\alpha}- \frac{1}{2} \sum_{i, j} \alpha_i \alpha_j \langle x_i, x_j \rangle + \sum_{i} \alpha_i y_i \nonumber \\ =& - \frac{n\lambda}{2} \boldsymbol{\alpha}^\text{T}\boldsymbol{\alpha}- \frac{1}{2} \boldsymbol{\alpha}^\text{T}\mathbf{K}\boldsymbol{\alpha}+ \boldsymbol{\alpha}^\text{T}\mathbf{y} \end{align}\]
By again taking derivative w.r.t. \(\alpha\), we have
\[ - n\lambda \mathbf{I}\boldsymbol{\alpha}- \mathbf{K}\boldsymbol{\alpha}+ \mathbf{y}= \mathbf{0},\] and the solution is the same as what we had before
\[\boldsymbol{\alpha}= (\mathbf{K}+ n\lambda \mathbf{I})^{-1} \mathbf{y}\]