Kernel Principal Component Analysis

1 Motivation

Principal Component Analysis (PCA) is a widely used technique for dimensionality reduction and feature extraction. However, traditional PCA is limited to linear relationships in the data. Kernel Principal Component Analysis (KPCA) extends PCA to capture non-linear relationships by using kernel methods. This allows KPCA to find more complex structures in the data, making it a powerful tool.

2 The Two-Moon Example

To illustrate the effectiveness of KPCA, we will use the two-moon dataset, which consists of two interleaving half circles. This dataset is a classic example where traditional PCA fails to capture the underlying structure due to its non-linear nature.

  library(ggplot2)
  library(dplyr)
  set.seed(546)
  
  #generate two-moon data
  n <- 200
  theta <- runif(n, 0, pi)
  x1 <- cbind(cos(theta), sin(theta)) + matrix(rnorm(2*n, sd=0.05), n, 2)
  x2 <- cbind(1 - cos(theta), 1 - sin(theta) - 0.5) + matrix(rnorm(2*n, sd=0.05), n, 2)
  
  data <- rbind(x1, x2) %>% as.data.frame()
  colnames(data) <- c("x1", "x2")
  data$label <- factor(rep(c("Class 1", "Class 2"), each = n))
  
  # plot the data
  ggplot(data, aes(x = x1, y = x2, color = label)) +  
    geom_point(size = 2) + 
    theme_minimal() + 
    labs(title = "Two-Moon Dataset", x = "X1", y = "X2") +   
    theme(legend.title = element_blank())

A standard PCA on this dataset would not be able to separate the two classes effectively, and nonlinear model is needed. We should note that many succesful tools for such dataset would require nonlinear transformation, such as spectral clustering.

Let’s first quickly review the idea of PCA. Suppose a data matrix \(X \in \mathbb{R}^{n \times p}\), is centered (i.e., each column has mean zero). PCA seeks to find the directions (principal components) that maximize the variance of the projected data. This is equivalent to solving the following optimization problem (first PC):

\[ \begin{aligned} & \max_{v \in \mathbb{R}^p, \lVert v \rVert = 1} v^\T X^\T X v \\ =& \max_{v \in \mathbb{R}^p, \lVert v \rVert = 1} \frac{1}{n} \sum_{i=1}^n (x_i^\T v)^2 \\ =& \max_{v \in \mathbb{R}^p, \lVert v \rVert = 1} \frac{1}{n} \sum_{i=1}^n v^\T (x_i x_i^\T) v \\ =& \max_{v \in \mathbb{R}^p, \lVert v \rVert = 1} v^\T \left( \frac{1}{n} \sum_{i=1}^n x_i x_i^\T \right) v \\ =& \max_{v \in \mathbb{R}^p, \lVert v \rVert = 1} v^\T \Sigma v \end{aligned} \]

Where \(\Sigma = \frac{1}{n} X^\T X = \frac{1}{n} \sum_{i=1}^n x_i x_i^\T\) is the sample covariance matrix. The solution to this problem is given by the eigenvectors of the covariance matrix corresponding to the largest eigenvalues of \(\Sigma\), i.e., \(\Sigma v = \lambda v\).

3 A new perspective of PCA

In the original PCA, we are living in the Euclidean space \(\mathbb{R}^p\). The concept of distance and inner product are defined in the standard way. However, if we want to perform nonlinear PCA, we need to first lift the data to a higher-dimensional space. In our case, let’s lift them to an infinite-dimensional space, i.e., the RKHS, hence each row of \(X\) is now a function in the RKHS. Let’s think about each concept in the PCA setting and find the corresponding concept in the RKHS setting.

Data points: In PCA, each data point is a vector in \(\mathbb{R}^p\). In the RKHS setting, each data point represents a function \(f_i\) (infinite dimensional vector) in the RKHS.
A direction: In PCA, a direction is a vector in \(\mathbb{R}^p\) (they live in the same space as the data points). In the RKHS setting, a direction is also a function in the RKHS.
Projecting data onto a direction: In PCA, projecting a data point \(x_i\) onto a direction \(v\) is done by computing the inner product \(x_i^\T v\). In the RKHS setting, projecting a function \(f_i\) onto a direction \(g\) is done by computing the inner product \(\langle f_i, g \rangle_{\mathcal{H}}\), which is also called the score.
Variance of the projected data: In PCA, the variance of the projected data onto a direction \(v\) is given by \(\frac{1}{n} \sum_{i=1}^n (x_i^\T v)^2\). In the RKHS, the variance is given by \(\frac{1}{n} \sum_{i=1}^n \langle f_i, g \rangle_{\mathcal{H}}^2\).

Here comes a tricky part: how do we compute the variance matrix in the RKHS? In PCA, we have the covariance matrix \(\bS = \frac{1}{n} \sum_{i=1}^n x_i x_i^\T\), which is a \(p \times p\) matrix. In the RKHS, we need to define a covariance operator that maps functions to functions. This operator is defined as follows:

\[ \Sigma = \frac{1}{n} \sum_{i=1}^n f_i \otimes f_i \]

where \(f_i \otimes f_i\) is the outer product operator defined by \((f_i \otimes f_i)(g) = \langle f_i, g \rangle_{\mathcal{H}} f_i\) for any \(g \in \mathcal{H}\). The covariance operator \(\Sigma\) is a linear operator that maps functions in the RKHS to functions in the RKHS. What does this even mean? Let’s use the feature map representation to understand this idea.

4 The Feature Mapping View

Based on our previous understanding of RKHS, we can simply create a feature map \(\Phi\) that maps the data points to a higher-dimensional space where linear PCA can be (some how) applied on that space. For example, let’s consider a general feature map:

\[ \Phi(x) = (\phi_1(x), \phi_2(x), \ldots, \phi_m(x))^\T \]

We want to clarify that because this could be an infinite-dimensional space, we will not search for a projection direction freely in this space. Instead, we will search for a direction in the span of the data points in this feature space, i.e.,

\[ v = \sum_{i=1}^n \alpha_i \Phi(x_i) \]

Why doing this? Well, its eventually because of the representer theorem (we already know that the solution must lie in the span of the data points). So in this high-dimensional Euclidean space, say \(\cG\), we can still define the inner product and the variance in the same way as before. Following our previous discussion, the PCA problem can be formulated as:

\[ \max_{v \in \cG, \lVert v \rVert_\cG = 1} v^\T \left( \frac{1}{n} \sum_{i=1}^n \Phi(x_i) \Phi(x_i)^\T \right) v \]

where \(\frac{1}{n} \sum_{i=1}^n \Phi(x_i) \Phi(x_i)^\T\) still represents the covariance matrix in this high-dimensional space. Now we can plug in our representation of \(v\) into the above optimization problem. The objective function is

\[ \begin{aligned} & \left( \sum_{i=1}^n \alpha_i \Phi(x_i) \right)^\T \left( \frac{1}{n} \sum_{j=1}^n \Phi(x_j) \Phi(x_j)^\T \right) \left( \sum_{k=1}^n \alpha_k \Phi(x_k) \right) \\ =& \frac{1}{n} \sum_{i=1}^n \sum_{j=1}^n \sum_{k=1}^n \alpha_i \alpha_k \Phi(x_i)^\T \Phi(x_j) \Phi(x_j)^\T \Phi(x_k) \\ =& \frac{1}{n} \sum_{j=1}^n \left( \sum_{i=1}^n \alpha_i \Phi(x_i)^\T \Phi(x_j) \right) \left( \sum_{k=1}^n \alpha_k \Phi(x_j)^\T \Phi(x_k) \right) \\ =& \frac{1}{n} \sum_{j=1}^n \left( \sum_{i=1}^n \alpha_i K(x_i, x_j) \right)^2 \\ =& \frac{1}{n} \sum_{j=1}^n \left( \sum_{i=1}^n \alpha_i K_{ij} \right) \left( \sum_{k=1}^n \alpha_k K_{kj} \right) \\ =& \frac{1}{n} \sum_{i, k} \alpha_i \alpha_k \left( \sum_j K_{ij} K_{kj} \right) \\ =& \frac{1}{n} \sum_{i, k} \alpha_i \alpha_k (K K^\T)_{ik} \\ =& \frac{1}{n} \alpha^\T K^2 \alpha \end{aligned} \]

where \(K\) is the kernel matrix with \(K_{ij} = K(x_i, x_j) = \Phi(x_i)^\T \Phi(x_j)\). The constraint can be simplified as:

\[ \lVert \sum_{i=1}^n \alpha_i \Phi(x_i) \rVert_\cG^2 = \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j K(x_i, x_j) = \alpha^\T K \alpha = 1 \]

Therefore, the optimization problem can be rewritten as:

\[ \begin{aligned} \max_{\alpha \in \mathbb{R}^n} \quad & \frac{1}{n} \alpha^\T K^2 \alpha\\ \text{ subj. to } \quad & \alpha^\T K \alpha = 1 \end{aligned} \]

This is a standard Rayleigh quotient problem¹, and the solution is given by the eigenvectors of the matrix \(K\) corresponding to the largest eigenvalues. Or you can view this as the Lagrangian form:

\[ \mathcal{L}(\alpha, \lambda) = \frac{1}{n} \alpha^\T K^2 \alpha - \lambda (\alpha^\T K \alpha - 1) \]

Taking the derivative with respect to \(\alpha\) and setting it to zero, we get:

\[ \begin{aligned} \frac{2}{n} K^2 \alpha - 2 \lambda K \alpha = 0 \\ \Rightarrow K^2 \alpha = n \lambda K \alpha \\ \Rightarrow K \alpha = n \lambda \alpha \\ \end{aligned} \]

This suggests that we can obtain the principal components by solving the eigenvalue problem for the kernel matrix \(K\). The eigenvalues \(\lambda\) indicate the amount of variance captured by each principal component. The eigenvectors \(\alpha\) correspond to the coefficients for the directions in the feature space. Now, to predict the principal component scores for a new data point \(x\), we can compute:

\[ \text{score} = \langle \Phi(x), v \rangle_\cG = \left\langle \Phi(x), \sum_{i=1}^n \alpha_i \Phi(x_i) \right\rangle_\cG = \sum_{i=1}^n \alpha_i K(x, x_i) \]

Which seems to suggest all the things as we did in the RKHS.

5 Double-Centering of the Kernel Matrix

Well, there is actually a small issue here: we need to make sure the data in the feature space is centered. We know that otherwise the PCA would find the direction that points to the mean of the data. To center the data in the feature space, the following is the feature map view:

\[ \Phi_c(x_i) = \Phi(x_i) - \frac{1}{n} \sum_{i=1}^n \Phi(x_i) = \Phi(x_i) - \mu \]

That actually leads to a centering procedure on the kernel matrix as

\[ \begin{aligned} K_c(i,j) &= \langle \phi(x_i)-\mu,\; \phi(x_j)-\mu\rangle \\ &= K(i,j) - \langle \phi(x_i), \frac{1}{n}\sum_{k=1}^n \phi(x_k) \rangle - \langle \frac{1}{n}\sum_{k=1}^n \phi(x_k), \phi(x_j) \rangle\\ &\quad + \langle \frac{1}{n}\sum_{k=1}^n \phi(x_k), \frac{1}{n}\sum_{m=1}^n \phi(x_m) \rangle \\ &= K(i,j) - \frac{1}{n}\sum_{k=1}^n K(i,k) - \frac{1}{n}\sum_{k=1}^n K(k,j) + \frac{1}{n^2}\sum_{k=1}^n\sum_{m=1}^n K(k,m) \end{aligned} \]

This is in fact taking out the mean of each row and each column, and adding back the overall mean, known as the double centering. In matrix form, we can write it as

\[ \begin{aligned} K_c &= H K H \\ &= \big(I - \tfrac{1}{n}\mathbf 1 \mathbf 1^\top\big)\, K \,\big(I - \tfrac{1}{n}\mathbf 1 \mathbf 1^\top\big) \\ &= K \;-\; \tfrac{1}{n}\,\mathbf 1 \mathbf 1^\top K \;-\; \tfrac{1}{n}\,K \mathbf 1 \mathbf 1^\top \;+\; \tfrac{1}{n^2}\,\mathbf 1 \mathbf 1^\top K \mathbf 1 \mathbf 1^\top, \end{aligned} \]

Therefore, the final step of KPCA is to solve the eigenvalue problem for the centered kernel matrix \(K_c\):

6 The Two-Moon Example Revisited

Now we can apply KPCA to the two-moon dataset. Let’s use a Gaussian kernel. We will compute the kernel matrix, center it, and then solve the eigenvalue problem to obtain the principal components.

  # compute Gaussian kernel using the dist() function
  dist_matrix <- as.matrix(dist(data[, c("x1", "x2")]))
  sigma <- 0.2
  K <- exp(-dist_matrix^2 / (2 * sigma^2))
  
  # center the kernel matrix
  n_total <- nrow(K)
  H <- diag(n_total) - matrix(1, n_total, n_total) / n_total
  Kc <- H %*% K %*% H
  Kc <- (Kc + t(Kc)) / 2
  
  # eigen decomposition
  eig <- eigen(Kc)
  eig_values <- eig$values
  eig_vectors <- eig$vectors
  pc1 <- eig_vectors[, 1]
  pc2 <- eig_vectors[, 2]
  
  # plot the first two principal components
  kpca_data <- data.frame(PC1 = pc1, PC2 = pc2, label = data$label)
  ggplot(kpca_data, aes(x = PC1, y = PC2, color = label)) +  
    geom_point(size = 2) + 
    theme_minimal() + 
    labs(title = "Kernel PCA on Two-Moon Dataset", x = "Principal Component 1", y = "Principal Component 2") +   
    theme(legend.title = element_blank())

7 Predicting New Data Points

To predict the principal component scores for a new data point \(x_{new}\), we can compute the scores using the eigenvectors obtained from the training data. The direction we are projecting onto is given by

\[ v = \sum_{i=1}^n \alpha_i \Phi(x_i) \]

where \(\alpha_i\) are the components of the eigenvector corresponding to the principal component. The score for the new data point is then given by:

\[ \begin{aligned} \text{score} &= \langle \Phi(x_\text{new}), v \rangle \\ &= \sum_{i=1}^n \alpha_i K(x_\text{new}, x_i) \end{aligned} \]

8 The RKHS View

In fact, we can also view KPCA in the RKHS without explicitly constructing the feature map. Recall that in the RKHS, each data point \(x_i\) is mapped to a function \(f_i(\cdot) = K(x_i, \cdot)\). The covariance operator in the RKHS is given by (suppose with centered \(f_i\)’s):

\[ \Sigma = \frac{1}{n} \sum_{i=1}^n f_i \otimes f_i \]

This is corresponding to the covariance matrix in the feature space, but it serves as an operator that maps functions to functions. The way this operator works is that for any function \(f, g \in \mathcal{H}\), we have:

\[ (f \otimes h) g = \langle h, g \rangle_\cH f \]

The PCA problem in the RKHS can be formulated as:

\[ \begin{aligned} &\max_{g \in \cH, \lVert g \rVert_\cH = 1} \langle g, \Sigma g \rangle_\cH \\ =& \max_{g \in \cH, \lVert g \rVert_\cH = 1} \langle g, \frac{1}{n} \sum_{i=1}^n f_i \otimes f_i (g) \rangle_\cH \\ =& \max_{g \in \cH, \lVert g \rVert_\cH = 1} \frac{1}{n} \sum_{i=1}^n \langle f_i, g \rangle_\cH^2 \end{aligned} \]

Since the functions \(f_i\) are centered, we have \(\frac{1}{n} f_i = 0\), which means \(\frac{1}{n} \sum_{i=1}^n \langle f_i, g \rangle_\cH = 0\). This is equivalent to the constraint that the projected data has mean zero. Adding the two terms together, we have

\[ \frac{1}{n} \sum_{i=1}^n \langle f_i, g \rangle_\cH^2 - \left( \frac{1}{n} \sum_{i=1}^n \langle f_i, g \rangle_\cH \right)^2 = \text{Var}(\langle f_i, g \rangle_\cH) \]

Which is exactly the variance of the projected data onto the direction \(g\). Now we can optimize \(g\) to maximize this variance. Using the representer theorem, we can express the direction \(g\) as a linear combination of the functions \(f_i\):

\[ g = \sum_{i=1}^n \alpha_i f_i \]

Plugging this representation into the PCA problem, we have: \[ \begin{aligned} \langle g, \Sigma g \rangle_\cH &= \frac{1}{n} \sum_{i=1}^n \langle f_i, \sum_{i=1}^n \alpha_i f_i \rangle_\cH^2 \\ &= \frac{1}{n} \sum_{i=1}^n \left( \sum_{j=1}^n \alpha_j \langle f_i, f_j \rangle_\cH \right)^2 \\ &= \frac{1}{n} \sum_{i=1}^n \left( \sum_{j=1}^n \alpha_j K(x_i, x_j) \right)^2 \\ &= \frac{1}{n} \alpha^\T K^2 \alpha \end{aligned} \]

And we get exactly the same optimization problem as before. Hence the feature map view can be skipped if we directly work in the RKHS. Now, for a new data point \(x_{new}\), since we are projecting them onto the direction \(g = \sum_{i=1}^n \alpha_i f_i\), the score is given by:

\[ \begin{aligned} \text{score} &= \langle f_\text{new}, g \rangle_\cH \\ &= \left\langle K(x_\text{new}, \cdot), \sum_{i=1}^n \alpha_i K(x_i, \cdot) \right\rangle_\cH \\ &= \sum_{i=1}^n \alpha_i K(x_\text{new}, x_i) \end{aligned} \]

We can visualize this function \(g\) in our two-moon example:

  # Create a grid of points
  x_seq <- seq(-1.5, 2.5, length.out = 100)
  y_seq <- seq(-1, 1.5, length.out = 100)
  grid <- expand.grid(x = x_seq, y = y_seq)
  
  # Compute the function values on the grid
  compute_function_value <- function(point, data, alpha, sigma) {
    K_values <- exp(-rowSums((t(t(data) - point))^2) / (2 * sigma^2))
    return(sum(alpha * K_values))
  }
  
  grid$z <- apply(grid, 1, compute_function_value, data = data[, c("x1", "x2")], alpha = pc1, sigma = sigma)
  
  # Plot the function
  ggplot(grid, aes(x = x, y = y)) +
    geom_raster(aes(fill = z), interpolate = TRUE) +
    scale_fill_viridis_c() +
    geom_contour(aes(z = z), color = "white", alpha = 0.5) +
    geom_point(data = data, aes(x = x1, y = x2, color = label), size = 1) +
    theme_minimal() +
    labs(title = "Visualization of the First Principal Component Function", x = "X1", y = "X2") +
    theme(legend.title = element_blank())

9 Other Non-supervised Learning Methods in RKHS

You can generate this idea to pretty much all the unsupervised learning methods that rely on certain distance metric. All we need to do is to covert the observed data into a function in RKHS, and then find the corresponding target function also from the RKHS, as a sum of the observed basis. For example, you can do this for K-means clustering, spectral clustering, etc.

You can perform a change of variable to convert it to a standard eigenvalue problem. Let \(\beta = K^{1/2} \alpha\), then the problem becomes \(\max_{\beta} \beta^\T K \beta\) subject to \(\beta^\T \beta = 1\). The solution is given by the eigenvectors of \(K\). Then we can convert back to \(\alpha\) by \(\alpha = K^{-1/2} \beta\). This means that \(K \alpha = K K^{-1/2} \beta = K^{-1/2} K \beta = \lambda K^{-1/2} \beta = \lambda \alpha\).↩︎