Chapter 15 Kernel Ridge Regression

15.1 Linear Regression as a Constraint Optimization

This example is motivated from an alternative derivation provided by Prof. Max Welling on his kernel ridge regression lecture note. This understanding would utilize a primal-dual derivation, which will also be used in SVM. This enables us to switch things to the kernel version through kernel trick. Consider a linear regression

\[ \underset{\boldsymbol{\beta}}{\text{minimize}} \,\, \frac{1}{n} \lVert \mathbf{y}- \mathbf{X}\boldsymbol{\beta}\rVert^2 + \lambda \lVert \boldsymbol{\beta}\rVert^2 \]

Introduce a new set of variables

\[ z_i = y_i - \mathbf{x}_i^\text{T}\boldsymbol{\beta}, \]

for \(i = 1, \ldots, n\). Then The original problem becomes

\[\begin{align} \underset{\boldsymbol{\beta}, \mathbf{z}}{\text{minimize}} \quad & \frac{1}{2n\lambda} \lVert \mathbf{z}\rVert^2 + \frac{1}{2} \lVert \boldsymbol{\beta}\rVert^2 \nonumber \\ \text{subj. to} \quad & z_i = y_i - \mathbf{x}_i^\text{T}\boldsymbol{\beta}, \,\, i = 1, \ldots, n. \end{align}\]

For solving a constrained optimization problem, we utilize the Lagrangian

\[ {\cal L}(\boldsymbol{\alpha}; \mathbf{z}, \boldsymbol{\beta}) = \frac{1}{2n\lambda} \lVert \mathbf{z}\rVert^2 + \frac{1}{2} \lVert \boldsymbol{\beta}\rVert^2 + \sum_{i=1}^n \alpha_i (y_i - \mathbf{x}_i^\text{T}\boldsymbol{\beta}- z_i) \]

with \(\alpha_i \in \mathbb{R}\). and our problem becomes

\[ \underset{\mathbf{z}, \boldsymbol{\beta}}{\min} \underset{\boldsymbol{\alpha}}{\max} {\cal L}(\boldsymbol{\alpha}; \mathbf{z}, \boldsymbol{\beta}) \]

The logic here for the Lagrangian is that if we first maximize w.r.t. \(\boldsymbol{\alpha}\), any solution that violates the constraint would give negative infinity of the Lagrangian. Hence, the outside maximization w.r.t. \(\mathbf{z}\) and \(\boldsymbol{\beta}\) would never those values. This is called the primal form and by swamping the two optimization, we get the dual form

\[ \underset{\boldsymbol{\alpha}}{\max} \underset{\mathbf{z}, \boldsymbol{\beta}}{\min} {\cal L}(\boldsymbol{\alpha}; \mathbf{z}, \boldsymbol{\beta}) \]

Note that the dual form is always a lower bound of the primal form. Under some regularity conditions, they are equal. To proceed with the optimization, we first minimize w.r.t. \(\boldsymbol{\beta}\) and \(\mathbf{z}\) by taking the derivative and set them to zero.

\[\begin{align} \frac{\partial \cal L}{\partial z_i} =&\, \frac{1}{n\lambda}z_i - \alpha_i = 0, \quad \text{for} \,\, i = 1, \ldots, n, \nonumber \\ \text{and}\,\, \frac{\partial {\cal L}}{\partial \boldsymbol{\beta}} =&\, \boldsymbol{\beta}- \sum_{i=1}^n \alpha_i \mathbf{x}_i = \mathbf{0} \end{align}\]

Hence, we have, the estimated \(\widehat{\boldsymbol{\beta}}\) is \(\sum_{i=1}^n \alpha_i \mathbf{x}_i\) that matches our previous understanding. Also, if we view the inner product as a linear kernel \(K(x_1, x_2) = x_1^\text{T}x_2\), the predicted value of at \(\mathbf{x}\) is

\[\begin{align} f(\mathbf{x}) =& \,\, \mathbf{x}^\text{T}\boldsymbol{\beta}\nonumber \\ =& \sum_{i=1}^n \alpha_i \mathbf{x}^\text{T}\mathbf{x}_i \nonumber \\ =& \sum_{i=1}^n \alpha_i \langle\mathbf{x}, \mathbf{x}_i \rangle \nonumber \\ =& \sum_{i=1}^n \alpha_i K(\mathbf{x}, \mathbf{x}_i). \end{align}\]

Now, to complete our dual solution, we plugin these results, and have

\[\begin{align} \underset{\boldsymbol{\alpha}}{\max} \underset{\mathbf{z}, \boldsymbol{\beta}}{\min} {\cal L} =& \frac{n\lambda}{2} \boldsymbol{\alpha}^\text{T}\boldsymbol{\alpha}+ \frac{1}{2} \sum_{i, j} \alpha_i \alpha_j x_i^\text{T}x_j + \sum_{j} \alpha_j \big(y_j - x_j^\text{T}\sum_i \alpha_i \mathbf{x}_i - n\lambda \alpha_i \big) \nonumber \\ =& - \frac{n\lambda}{2} \boldsymbol{\alpha}^\text{T}\boldsymbol{\alpha}- \frac{1}{2} \sum_{i, j} \alpha_i \alpha_j K(x_i, x_j) + \sum_{i} \alpha_i y_i \nonumber \\ =& - \frac{n\lambda}{2} \boldsymbol{\alpha}^\text{T}\boldsymbol{\alpha}- \frac{1}{2} \boldsymbol{\alpha}^\text{T}\mathbf{K}\boldsymbol{\alpha}+ \boldsymbol{\alpha}^\text{T}\mathbf{y} \end{align}\]

By again taking derivative w.r.t. \(\alpha\), we have

\[ - n\lambda \mathbf{I}\boldsymbol{\alpha}- \mathbf{K}\boldsymbol{\alpha}+ \mathbf{y}= \mathbf{0},\] and the solution is the same as what we had before

\[\boldsymbol{\alpha}= (\mathbf{K}+ n\lambda \mathbf{I})^{-1} \mathbf{y}\]

15.2 The Kernel Ridge Regression

With our understandings of the RKHS and the representer theorem (introduced in the next Chapter), we can say that for any regression function models, if we want the solution to be more flexible, we may solve it within a RKHS. This is essentially change the linear kernel previously into any other kernel function. In general, we consider the following regression problem:

\[\widehat f = \underset{f \in {\cal H}}{\arg\min} \,\, \frac{1}{n} \sum_{i=1}^n \Big(y_i - f(x_i) \Big)^2 + \lambda \lVert f \rVert_{\cal H}^2\] Since we know that the solution has to take the form

\[\widehat f = \sum_{i=1}^n \alpha_i K(x_i, \cdot),\] we can instead solve the problem as a ridge regression type of problem:

\[\widehat f = \underset{f \in {\cal H}}{\arg\min} \,\, \frac{1}{n} \big\lVert \mathbf{y}- \mathbf{K}\boldsymbol{\alpha}\big\rVert^2 + \lambda \lVert f \rVert_{\cal H}^2,\] where \(\mathbf{K}\) is an \(n \times n\) matrix with \(K(x_i, x_j)\) at its \((i,j)\)th element. With some simple calculation, we also have

\[\begin{align} \lVert f \rVert_{\cal H}^2 =& \langle f, f \rangle \nonumber \\ =& \langle \sum_{i=1}^n \alpha_i K(x_i, \cdot), \sum_{j=1}^n \alpha_j K(x_j, \cdot) \rangle \nonumber \\ =& \sum_{i, j} \alpha_i \alpha_j \big\langle K(x_i, \cdot), K(x_j, \cdot) \big\rangle \nonumber \\ =& \sum_{i, j} \alpha_i \alpha_j K(x_i, x_j) \nonumber \\ =& \boldsymbol{\alpha}^\text{T}\mathbf{K}\boldsymbol{\alpha} \end{align}\]

Hence, the problem becomes

\[\widehat f = \underset{f \in {\cal H}}{\arg\min} \,\, \frac{1}{n} \big\lVert \mathbf{y}- \mathbf{K}\boldsymbol{\alpha}\big\rVert^2 + \lambda \boldsymbol{\alpha}^\text{T}\mathbf{K}\boldsymbol{\alpha}.\] By taking the derivative with respect to \(\boldsymbol{\alpha}\), we have (note that \(\mathbf{K}\) is symmetric),

\[\begin{align} -\frac{1}{n} \mathbf{K}^\text{T}(\mathbf{y}- \mathbf{K}\boldsymbol{\alpha}) + \lambda \mathbf{K}\boldsymbol{\alpha}\overset{\text{set}}{=} \mathbf{0} \nonumber \\ \mathbf{K}(- \mathbf{y}+ \mathbf{K}\boldsymbol{\alpha}+ n\lambda \boldsymbol{\alpha}) = \mathbf{0}. \end{align}\] This implies

\[ \boldsymbol{\alpha}= (\mathbf{K}+ n\lambda \mathbf{I})^{-1} \mathbf{y}.\] and we obtained the same solution.

15.3 Ridge Regression as a Linear Kernel Model

When \(K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^\text{T}\mathbf{x}_j\), we also have \(\mathbf{K}= \mathbf{X}\mathbf{X}^\text{T}\). We should expect this to match the original ridge regression since this is essentially a linear regression. First, plug this into our previous result, we have

\[ \boldsymbol{\alpha}= (\mathbf{X}\mathbf{X}^\text{T}+ n\lambda \mathbf{I})^{-1} \mathbf{y}.\] and the fitted value is

\[ \widehat{\mathbf{y}} = \mathbf{K}\boldsymbol{\alpha}= \mathbf{X}\mathbf{X}^\text{T}(\mathbf{X}\mathbf{X}^\text{T}+ n\lambda \mathbf{I})^{-1} \mathbf{y}\] Using a matrix identity \((\mathbf{P}\mathbf{Q}+ \mathbf{I})^{-1}\mathbf{P}= \mathbf{P}(\mathbf{Q}\mathbf{P}+ \mathbf{I})^{-1}\), and let \(\mathbf{Q}= \mathbf{X}= \mathbf{P}^\text{T}\), we have

\[ \widehat{\mathbf{y}} = \mathbf{K}\boldsymbol{\alpha}= \mathbf{X}(\mathbf{X}^\text{T}\mathbf{X}+ n\lambda \mathbf{I})^{-1} \mathbf{X}^\text{T}\mathbf{y}\] and

\[ \widehat{\mathbf{y}} = \mathbf{X}\underbrace{\big[ \mathbf{X}^\text{T}\boldsymbol{\alpha}\big]}_{\boldsymbol{\beta}} = \mathbf{X}\underbrace{\big[ (\mathbf{X}^\text{T}\mathbf{X}+ n\lambda \mathbf{I})^{-1} \mathbf{X}^\text{T}\mathbf{y}\big]}_{\boldsymbol{\beta}}\]

which is simply the Ridge regression solution, and also the corresponding linear regression solution \(\widehat{\boldsymbol{\beta}} = \mathbf{X}^\text{T}\widehat{\boldsymbol{\alpha}}\). This makes the penalty term \(\boldsymbol{\alpha}^\text{T}\mathbf{K}\boldsymbol{\alpha}= \boldsymbol{\alpha}^\text{T}\mathbf{X}\mathbf{X}^\text{T}\boldsymbol{\alpha}= \boldsymbol{\beta}^\text{T}\boldsymbol{\beta}\), which maps every thing back to the ridge regression form. In this example, we can see that the functional form of our estimation is

\[\begin{align} \widehat f(\cdot) &= \cdot^\text{T}\widehat{\boldsymbol{\beta}} \\ &= \cdot^\text{T}\mathbf{X}^\text{T}\widehat{\boldsymbol{\alpha}} \\ &= \sum_{i=1}^n \widehat{\alpha}_i \langle \cdot, \mathbf{x}_i \rangle \\ \end{align}\]

which falls into the span of the linear kernel functions \(\langle \cdot, \mathbf{x}_i \rangle\).