Principal Component Analysis (PCA) is a widely used technique for dimensionality reduction and feature extraction. However, traditional PCA is limited to linear relationships in the data. Kernel Principal Component Analysis (KPCA) extends PCA to capture non-linear relationships by using kernel methods. This allows KPCA to find more complex structures in the data, making it a powerful tool.
To illustrate the effectiveness of KPCA, we will use the two-moon dataset, which consists of two interleaving half circles. This dataset is a classic example where traditional PCA fails to capture the underlying structure due to its non-linear nature.
library(ggplot2)
library(dplyr)
set.seed(546)
#generate two-moon data
n <- 200
theta <- runif(n, 0, pi)
x1 <- cbind(cos(theta), sin(theta)) + matrix(rnorm(2*n, sd=0.05), n, 2)
x2 <- cbind(1 - cos(theta), 1 - sin(theta) - 0.5) + matrix(rnorm(2*n, sd=0.05), n, 2)
data <- rbind(x1, x2) %>% as.data.frame()
colnames(data) <- c("x1", "x2")
data$label <- factor(rep(c("Class 1", "Class 2"), each = n))
# plot the data
ggplot(data, aes(x = x1, y = x2, color = label)) +
geom_point(size = 2) +
theme_minimal() +
labs(title = "Two-Moon Dataset", x = "X1", y = "X2") +
theme(legend.title = element_blank())
A standard PCA on this dataset would not be able to separate the two classes effectively, and nonlinear model is needed. We should note that many succesful tools for such dataset would require nonlinear transformation, such as spectral clustering.
Let’s first quickly review the idea of PCA. Suppose a data matrix \(X \in \mathbb{R}^{n \times p}\), is centered (i.e., each column has mean zero). PCA seeks to find the directions (principal components) that maximize the variance of the projected data. This is equivalent to solving the following optimization problem (first PC):
\[ \begin{aligned} & \max_{v \in \mathbb{R}^p, \lVert v \rVert = 1} v^\T X^\T X v \\ =& \max_{v \in \mathbb{R}^p, \lVert v \rVert = 1} \frac{1}{n} \sum_{i=1}^n (x_i^\T v)^2 \\ =& \max_{v \in \mathbb{R}^p, \lVert v \rVert = 1} \frac{1}{n} \sum_{i=1}^n v^\T (x_i x_i^\T) v \\ =& \max_{v \in \mathbb{R}^p, \lVert v \rVert = 1} v^\T \left( \frac{1}{n} \sum_{i=1}^n x_i x_i^\T \right) v \\ =& \max_{v \in \mathbb{R}^p, \lVert v \rVert = 1} v^\T \Sigma v \end{aligned} \]
Where \(\Sigma = \frac{1}{n} X^\T X = \frac{1}{n} \sum_{i=1}^n x_i x_i^\T\) is the sample covariance matrix. The solution to this problem is given by the eigenvectors of the covariance matrix corresponding to the largest eigenvalues of \(\Sigma\), i.e., \(\Sigma v = \lambda v\).
In the original PCA, we are living in the Euclidean space \(\mathbb{R}^p\). The concept of distance and inner product are defined in the standard way. However, if we want to perform nonlinear PCA, we need to first lift the data to a higher-dimensional space. In our case, let’s lift them to an infinite-dimensional space, i.e., the RKHS, hence each row of \(X\) is now a function in the RKHS. Let’s think about each concept in the PCA setting and find the corresponding concept in the RKHS setting.
Here comes a tricky part: how do we compute the variance matrix in the RKHS? In PCA, we have the covariance matrix \(\bS = \frac{1}{n} \sum_{i=1}^n x_i x_i^\T\), which is a \(p \times p\) matrix. In the RKHS, we need to define a covariance operator that maps functions to functions. This operator is defined as follows:
\[ \Sigma = \frac{1}{n} \sum_{i=1}^n f_i \otimes f_i \]
where \(f_i \otimes f_i\) is the outer product operator defined by \((f_i \otimes f_i)(g) = \langle f_i, g \rangle_{\mathcal{H}} f_i\) for any \(g \in \mathcal{H}\). The covariance operator \(\Sigma\) is a linear operator that maps functions in the RKHS to functions in the RKHS. What does this even mean? Let’s use the feature map representation to understand this idea.
Based on our previous understanding of RKHS, we can simply create a feature map \(\Phi\) that maps the data points to a higher-dimensional space where linear PCA can be (some how) applied on that space. For example, let’s consider a general feature map:
\[ \Phi(x) = (\phi_1(x), \phi_2(x), \ldots, \phi_m(x))^\T \]
We want to clarify that because this could be an infinite-dimensional space, we will not search for a projection direction freely in this space. Instead, we will search for a direction in the span of the data points in this feature space, i.e.,
\[ v = \sum_{i=1}^n \alpha_i \Phi(x_i) \]
Why doing this? Well, its eventually because of the representer theorem (we already know that the solution must lie in the span of the data points). So in this high-dimensional Euclidean space, say \(\cG\), we can still define the inner product and the variance in the same way as before. Following our previous discussion, the PCA problem can be formulated as:
\[ \max_{v \in \cG, \lVert v \rVert_\cG = 1} v^\T \left( \frac{1}{n} \sum_{i=1}^n \Phi(x_i) \Phi(x_i)^\T \right) v \]
where \(\frac{1}{n} \sum_{i=1}^n \Phi(x_i) \Phi(x_i)^\T\) still represents the covariance matrix in this high-dimensional space. Now we can plug in our representation of \(v\) into the above optimization problem. The objective function is
\[ \begin{aligned} & \left( \sum_{i=1}^n \alpha_i \Phi(x_i) \right)^\T \left( \frac{1}{n} \sum_{j=1}^n \Phi(x_j) \Phi(x_j)^\T \right) \left( \sum_{k=1}^n \alpha_k \Phi(x_k) \right) \\ =& \frac{1}{n} \sum_{i=1}^n \sum_{j=1}^n \sum_{k=1}^n \alpha_i \alpha_k \Phi(x_i)^\T \Phi(x_j) \Phi(x_j)^\T \Phi(x_k) \\ =& \frac{1}{n} \sum_{j=1}^n \left( \sum_{i=1}^n \alpha_i \Phi(x_i)^\T \Phi(x_j) \right) \left( \sum_{k=1}^n \alpha_k \Phi(x_j)^\T \Phi(x_k) \right) \\ =& \frac{1}{n} \sum_{j=1}^n \left( \sum_{i=1}^n \alpha_i K(x_i, x_j) \right)^2 \\ =& \frac{1}{n} \sum_{j=1}^n \left( \sum_{i=1}^n \alpha_i K_{ij} \right) \left( \sum_{k=1}^n \alpha_k K_{kj} \right) \\ =& \frac{1}{n} \sum_{i, k} \alpha_i \alpha_k \left( \sum_j K_{ij} K_{kj} \right) \\ =& \frac{1}{n} \sum_{i, k} \alpha_i \alpha_k (K K^\T)_{ik} \\ =& \frac{1}{n} \alpha^\T K^2 \alpha \end{aligned} \]
where \(K\) is the kernel matrix with \(K_{ij} = K(x_i, x_j) = \Phi(x_i)^\T \Phi(x_j)\). The constraint can be simplified as:
\[ \lVert \sum_{i=1}^n \alpha_i \Phi(x_i) \rVert_\cG^2 = \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j K(x_i, x_j) = \alpha^\T K \alpha = 1 \]
Therefore, the optimization problem can be rewritten as:
\[ \begin{aligned} \max_{\alpha \in \mathbb{R}^n} \quad & \frac{1}{n} \alpha^\T K^2 \alpha\\ \text{ subj. to } \quad & \alpha^\T K \alpha = 1 \end{aligned} \]
This is a standard Rayleigh quotient problem1, and the solution is given by the eigenvectors of the matrix \(K\) corresponding to the largest eigenvalues. Or you can view this as the Lagrangian form:
\[ \mathcal{L}(\alpha, \lambda) = \frac{1}{n} \alpha^\T K^2 \alpha - \lambda (\alpha^\T K \alpha - 1) \]
Taking the derivative with respect to \(\alpha\) and setting it to zero, we get:
\[ \begin{aligned} \frac{2}{n} K^2 \alpha - 2 \lambda K \alpha = 0 \\ \Rightarrow K^2 \alpha = n \lambda K \alpha \\ \Rightarrow K \alpha = n \lambda \alpha \\ \end{aligned} \]
This suggests that we can obtain the principal components by solving the eigenvalue problem for the kernel matrix \(K\). The eigenvalues \(\lambda\) indicate the amount of variance captured by each principal component. The eigenvectors \(\alpha\) correspond to the coefficients for the directions in the feature space. Now, to predict the principal component scores for a new data point \(x\), we can compute:
\[ \text{score} = \langle \Phi(x), v \rangle_\cG = \left\langle \Phi(x), \sum_{i=1}^n \alpha_i \Phi(x_i) \right\rangle_\cG = \sum_{i=1}^n \alpha_i K(x, x_i) \]
Which seems to suggest all the things as we did in the RKHS.
Well, there is actually a small issue here: we need to make sure the data in the feature space is centered. We know that otherwise the PCA would find the direction that points to the mean of the data. To center the data in the feature space, the following is the feature map view:
\[ \Phi_c(x_i) = \Phi(x_i) - \frac{1}{n} \sum_{i=1}^n \Phi(x_i) = \Phi(x_i) - \mu \]
That actually leads to a centering procedure on the kernel matrix as
\[ \begin{aligned} K_c(i,j) &= \langle \phi(x_i)-\mu,\; \phi(x_j)-\mu\rangle \\ &= K(i,j) - \langle \phi(x_i), \frac{1}{n}\sum_{k=1}^n \phi(x_k) \rangle - \langle \frac{1}{n}\sum_{k=1}^n \phi(x_k), \phi(x_j) \rangle\\ &\quad + \langle \frac{1}{n}\sum_{k=1}^n \phi(x_k), \frac{1}{n}\sum_{m=1}^n \phi(x_m) \rangle \\ &= K(i,j) - \frac{1}{n}\sum_{k=1}^n K(i,k) - \frac{1}{n}\sum_{k=1}^n K(k,j) + \frac{1}{n^2}\sum_{k=1}^n\sum_{m=1}^n K(k,m) \end{aligned} \]
This is in fact taking out the mean of each row and each column, and adding back the overall mean, known as the double centering. In matrix form, we can write it as
\[ \begin{aligned} K_c &= H K H \\ &= \big(I - \tfrac{1}{n}\mathbf 1 \mathbf 1^\top\big)\, K \,\big(I - \tfrac{1}{n}\mathbf 1 \mathbf 1^\top\big) \\ &= K \;-\; \tfrac{1}{n}\,\mathbf 1 \mathbf 1^\top K \;-\; \tfrac{1}{n}\,K \mathbf 1 \mathbf 1^\top \;+\; \tfrac{1}{n^2}\,\mathbf 1 \mathbf 1^\top K \mathbf 1 \mathbf 1^\top, \end{aligned} \]
Therefore, the final step of KPCA is to solve the eigenvalue problem for the centered kernel matrix \(K_c\):
Now we can apply KPCA to the two-moon dataset. Let’s use a Gaussian kernel. We will compute the kernel matrix, center it, and then solve the eigenvalue problem to obtain the principal components.
# compute Gaussian kernel using the dist() function
dist_matrix <- as.matrix(dist(data[, c("x1", "x2")]))
sigma <- 0.2
K <- exp(-dist_matrix^2 / (2 * sigma^2))
# center the kernel matrix
n_total <- nrow(K)
H <- diag(n_total) - matrix(1, n_total, n_total) / n_total
Kc <- H %*% K %*% H
Kc <- (Kc + t(Kc)) / 2
# eigen decomposition
eig <- eigen(Kc)
eig_values <- eig$values
eig_vectors <- eig$vectors
pc1 <- eig_vectors[, 1]
pc2 <- eig_vectors[, 2]
# plot the first two principal components
kpca_data <- data.frame(PC1 = pc1, PC2 = pc2, label = data$label)
ggplot(kpca_data, aes(x = PC1, y = PC2, color = label)) +
geom_point(size = 2) +
theme_minimal() +
labs(title = "Kernel PCA on Two-Moon Dataset", x = "Principal Component 1", y = "Principal Component 2") +
theme(legend.title = element_blank())
In fact, we can also view KPCA in the RKHS without explicitly constructing the feature map. Recall that in the RKHS, each data point \(x_i\) is mapped to a function \(f_i(\cdot) = K(x_i, \cdot)\). The covariance operator in the RKHS is given by (suppose with centered \(f_i\)’s):
\[ \Sigma = \frac{1}{n} \sum_{i=1}^n f_i \otimes f_i \]
This is corresponding to the covariance matrix in the feature space, but it serves as an operator that maps functions to functions. The way this operator works is that for any function \(f, g \in \mathcal{H}\), we have:
\[ (f \otimes h) g = \langle h, g \rangle_\cH f \]
The PCA problem in the RKHS can be formulated as:
\[ \begin{aligned} &\max_{g \in \cH, \lVert g \rVert_\cH = 1} \langle g, \Sigma g \rangle_\cH \\ =& \max_{g \in \cH, \lVert g \rVert_\cH = 1} \langle g, \frac{1}{n} \sum_{i=1}^n f_i \otimes f_i (g) \rangle_\cH \\ =& \max_{g \in \cH, \lVert g \rVert_\cH = 1} \frac{1}{n} \sum_{i=1}^n \langle f_i, g \rangle_\cH^2 \end{aligned} \]
Since the functions \(f_i\) are centered, we have \(\frac{1}{n} f_i = 0\), which means \(\frac{1}{n} \sum_{i=1}^n \langle f_i, g \rangle_\cH = 0\). This is equivalent to the constraint that the projected data has mean zero. Adding the two terms together, we have
\[ \frac{1}{n} \sum_{i=1}^n \langle f_i, g \rangle_\cH^2 - \left( \frac{1}{n} \sum_{i=1}^n \langle f_i, g \rangle_\cH \right)^2 = \text{Var}(\langle f_i, g \rangle_\cH) \]
Which is exactly the variance of the projected data onto the direction \(g\). Now we can optimize \(g\) to maximize this variance. Using the representer theorem, we can express the direction \(g\) as a linear combination of the functions \(f_i\):
\[ g = \sum_{i=1}^n \alpha_i f_i \]
Plugging this representation into the PCA problem, we have: \[ \begin{aligned} \langle g, \Sigma g \rangle_\cH &= \frac{1}{n} \sum_{i=1}^n \langle f_i, \sum_{i=1}^n \alpha_i f_i \rangle_\cH^2 \\ &= \frac{1}{n} \sum_{i=1}^n \left( \sum_{j=1}^n \alpha_j \langle f_i, f_j \rangle_\cH \right)^2 \\ &= \frac{1}{n} \sum_{i=1}^n \left( \sum_{j=1}^n \alpha_j K(x_i, x_j) \right)^2 \\ &= \frac{1}{n} \alpha^\T K^2 \alpha \end{aligned} \]
And we get exactly the same optimization problem as before. Hence the feature map view can be skipped if we directly work in the RKHS.
You can generate this idea to pretty much all the unsupervised learning methods that rely on certain distance metric. All we need to do is to covert the observed data into a function in RKHS, and then find the corresponding target function also from the RKHS, as a sum of the observed basis. For example, you can do this for K-means clustering, spectral clustering, etc.
You can perform a change of variable to convert it to a standard eigenvalue problem. Let \(\beta = K^{1/2} \alpha\), then the problem becomes \(\max_{\beta} \beta^\T K \beta\) subject to \(\beta^\T \beta = 1\). The solution is given by the eigenvectors of \(K\). Then we can convert back to \(\alpha\) by \(\alpha = K^{-1/2} \beta\). This means that \(K \alpha = K K^{-1/2} \beta = K^{1/2} K^{1/2} \alpha = n \lambda \alpha\).↩︎