Kernel Mean Embedding and Applications

1 Motivation: Representing a Distribution

Suppose we are given a random variable $X$ with distribution $P$ on a domain $\cX$. The distribution is a complicated object: if it has a density $p(x)$, then that density is a function on $\cX$ and also an infinite-dimensional vector. A natural question to ask is, if we could present this density in a finite-dimensional way, how would we do it?

There are some straight forward ideas for representing a distribution. For example, we can consider the mean or other moments of the distribution, with the definition of

\[ \mu_P = \E_P[X] = \int x p(x) dx. \]

And for some distributions, the mean and variance can already characterize the distribution. However, in general, the mean and variance are not enough to represent a distribution. Another straight forward idea is to discretize the domain $\cX$ into a finite number of bins / regions, and then represent the density by the probability masses in those bins. This is essentially what a histogram does. More precisely, if we partition $\cX$ into $N$ disjoint regions $A_1, \ldots, A_m$, then we can represent the density within each region as some form of the mean of the density function within that subset, i.e.,

\[ \begin{aligned} \mu_P(A_i) &= \int 1_{A_i}(x) p(x) dx \\ &= \int_{x \in A_i} p(x) dx \\ &= P(X \in A_i) \end{aligned} \]

for each $i = 1, \ldots, m$. The vector $\big( \mu_P(A_1), \ldots, \mu_P(A_m) \big)^\T$ is a finite-dimensional representation of the distribution $P$. This is of course very restricted and it can be difficult to compare two different definitions of the bins — their dimensions may not even match. So how about we define something that is more flexible and also live in a common space, say the RKHS?

2 Partition Kernel

In the previous example, let’s view the partition from a different angle. What if we represent such means as a function (in RKHS?) on the domain $\cX$? More specifically, we can redefine the previous example as

\[ \begin{aligned} \mu_{P_i}(\cdot) &= \int 1_{A_i}(\cdot)1_{A_i}(x) p(x) dx \\ &= \int_{x \in A_i} 1_{A_i}(\cdot) p(x) dx \\ &= \cases{ P(X \in A_i) & if $\,\,$ $\cdot \in A_i$ \cr 0 & otherwise } \end{aligned} \]

for each $i = 1, \ldots, m$. The function $\mu_{P_i}(\cdot)$ is a constant function on $A_i$ that takes the value of the probability within $A_i$. We can of course consider a more flexible version of this by considering all the regions $A_i$’s together, i.e., defining a feature map

\[ \Phi(\cdot) = \big(1_{A_1}(\cdot), \ldots, 1_{A_m}(\cdot) \big)^\T, \]

Then, the representation of the distribution can be defined as

\[ \begin{aligned} \mu_P(\cdot) &= \int \langle \Phi(\cdot) , \Phi(x) \rangle p(x) \, dx \\ &= \langle \Phi(\cdot), \int \Phi(x) p(x) \, dx \rangle \\ &= \begin{bmatrix} 1_{A_1}(\cdot) \\ \vdots \\ 1_{A_m}(\cdot) \end{bmatrix}^\top \begin{bmatrix} P(X \in A_1) \\ \vdots \\ P(X \in A_m) \end{bmatrix} \end{aligned} \]

What does this function effectively do? It takes the value of $P(X \in A_i)$ when the input is in $A_i$. So it is a piece-wise constant function on $\cX$ that summarizes the probability masses in each region. Embedding the pdf into a space of piece-wise constant functions induced from the partition of $\cX$. In fact, this a tree kernel, which we will take about later.

3 Kernel Mean Embedding

In general, we know that each feature map $\Phi$ induces a kernel $k(x,y) = \langle \Phi(x), \Phi(y) \rangle$. So we can define the representation of a distribution $P$ as

\[ \begin{aligned} \mu_P(\cdot) &= \int k(\cdot, x) p(x) \, dx \\ &= \int k(\cdot, x) \, dP(x) \\ &= \E_{X \sim P} [k(\cdot, X)] \end{aligned} \]

This is called the kernel mean embedding of the distribution $P$ into the RKHS $\cH$ associated with the p.d. kernel $k$. There are some nice properties of this embedding. First, it is a linear operator, i.e., for any two distributions $P$ and $Q$, and any $a,b \in \RR$, we have

\[ \begin{aligned} \mu_{aP + bQ}(\cdot) &= \int k(\cdot, x) \, d(aP + bQ)(x) \\ &= a \int k(\cdot, x) \, dP(x) + b \int k(\cdot, x) \, dQ(x) \\ &= a \mu_P(\cdot) + b \mu_Q(\cdot) \end{aligned} \]

Second, it also satisfies the reproducing property, i.e., for any $f \in \cH$, we have

\[ \begin{aligned} \langle f, \mu_P \rangle_\cH &= \Big\langle f, \int k(\cdot, x) \, dP(x) \Big\rangle_\cH \\ &= \int \langle f, k(\cdot, x) \rangle_\cH \, dP(x) \\ &= \int f(x) \,dP(x) \\ &= \E_{X \sim P} [f(X)] \end{aligned} \]

Here the term reproducing refers to the fact that the inner product with the mean embedding $\mu_P$ “reproduces” the expectation of the function $f$ under the distribution $P$.

4 Examples: Linear Kernel

Let’s take the linear kernel $k(x,y) = x^\top y$ as an example. The associated RKHS is the space of linear functions, i.e., $\cH = \{ f(x) = w^\top x : w \in \RR^d \}$. The kernel mean embedding of a distribution $P$ is

\[ \begin{aligned} \mu_P(\cdot) &= \int k(\cdot, x) \, dP(x) \\ &= \int \cdot^\top x \, dP(x) \\ &= \cdot^\top \int x \, dP(x) \\ &= \cdot^\top \E_{X \sim P}[X] \end{aligned} \]

So the kernel mean embedding is a linear function whose coefficient is the mean of the distribution $P$. Given any input $t$, the mean embedding is to calculate the mean of a linear combination of $X$, i.e., the mean of $t^\top X$.

5 Example: Quadratic Kernel (via kernel trick)

We know that a quadratic kernel can be written as $k(x,y) = (x^\top y)^2$. The associated feature map is

\[ \Phi(x) = (x_i x_j)_{1 \leq i,j \leq d} \in \RR^{d^2} \]

Hence the kernel mean embedding of a distribution $P$ is

\[ \begin{aligned} \mu_P(\cdot) &= \int k(\cdot, x) \, dP(x) \\ &= \int (\cdot^\top x)^2 \, dP(x) \\ &= \int (\cdot^\top x) (x^\top \cdot) \, dP(x) \\ &= \int \cdot^\top (x x^\top) \cdot \, dP(x) \\ &= \cdot^\top \Big( \int x x^\top \, dP(x) \Big) \cdot \\ &= \cdot^\top \E_{X \sim P}[X X^\top] \cdot \end{aligned} \]

Remember that the term in the middle is the second moment matrix ($d \times d$) of the distribution $P$. Given any input $t$, the mean embedding is to calculate the mean of a quadratic form of $X$, i.e., the mean of $t^\top X X^\top t$ when $X \sim P$.

6 Example: Gaussian Kernel

For this example, let’s visualize the kernel mean embedding of a Uniform distribution $P = \text{Unif}(-1,1)$ on $\RR$ with a Gaussian kernel $k(x,y) = \exp(-\|x-y\|^2 / (2\sigma^2))$. The kernel mean embedding is

\[ \begin{aligned} \mu_P(\cdot) &= \int k(\cdot, x) \, dP(x) \\ &= \int \exp\Big(-\frac{\|\cdot - x\|^2}{2\sigma^2}\Big) \, dP(x) \\ &= \int_{-1}^1 \frac{1}{2} \exp\Big(-\frac{\|\cdot - x\|^2}{2\sigma^2}\Big) \, dx \end{aligned} \]

Hence the embedding is the convolution of the Uniform density and the Gaussian kernel, which then simplifies to the difference between two Gaussian CDF. Below is the plot of the embedding with $\sigma = 0.3$ alone with the original Uniform density (up to a normalizing constant).

  library(ggplot2)

  set.seed(123)
  sigma <- 0.3
  x_seq <- seq(-3, 3, length.out = 200)
  mu_P <- function(t, sigma) {
    (pnorm(t + 1, sd = sigma) - pnorm(t - 1, sd = sigma)) / 2 
  }

  df <- data.frame(x = x_seq, 
                   y1 = mu_P(x_seq, sigma = 1), # sigma = 0.3
                   y2 = mu_P(x_seq, sigma = 0.4),   # sigma = 1
                   y3 = mu_P(x_seq, sigma = 0.1))   # sigma = 0.1
  
  p <- ggplot(df) + 
    geom_line(aes(x = x, y = y1), color = 'blue', size = 1) +
    geom_line(aes(x = x, y = y2), color = 'darkorange', size = 1) +
    geom_line(aes(x = x, y = y3), color = 'darkgreen', size = 1) +
    geom_area(data = subset(df, x >= -1 & x <= 1),
              aes(x = x, y = 0.5), fill = "gray", alpha = 0.5) +
    labs(x = 'x', y = expression(mu[P](x)), 
         title = 'KME of Uniform(-1,1) with Gaussian Kernel') +
    theme_minimal() +
    theme(text = element_text(size=16)) +
    ylim(0, 0.6) +
    annotate("text", x = 2, y = 0.55, label = "sigma == 1",
             parse = TRUE, color = 'blue', size = 5) +
    annotate("text", x = 2, y = 0.49, label = "sigma == 0.4",
             parse = TRUE, color = 'darkorange', size = 5) + 
    annotate("text", x = 2, y = 0.43, label = "sigma == 0.1",
             parse = TRUE, color = 'darkgreen', size = 5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
  
  print(p)

Let’s then look at how this is done for a sample version. If we observe a set of observations $\{x_i\}_{i=1}^n$ from the Uniform$[-1, 1]$ distribution, then the empirical kernel mean embedding is

\[ \hat{\mu}_P(\cdot) = \frac{1}{n} \sum_{i=1}^n k(\cdot, x_i). \]

Below is the plot of the empirical embedding with $n=50$ samples, respectively.

  set.seed(546)
  n = 50
  
  x_sample <- runif(n, min = -1, max = 1)
  hat_mu <- function(t, x_sample, sigma) {
    sapply(t, function(x) mean(exp(- (x - x_sample)^2 / (2 * sigma^2)) / 2 / sigma))
  }
  
  df2 <- data.frame(x = x_seq, 
                   y1 = hat_mu(x_seq, x_sample, sigma = 1), # sigma = 0.3
                   y2 = hat_mu(x_seq, x_sample, sigma = 0.4),   # sigma = 1
                   y3 = hat_mu(x_seq, x_sample, sigma = 0.1))   # sigma = 0.1
  
  p2 <- ggplot(df2) +
    geom_line(aes(x = x, y = y1), color = 'blue', size = 1) +
    geom_line(aes(x = x, y = y2), color = 'darkorange', size = 1) +
    geom_line(aes(x = x, y = y3), color = 'darkgreen', size = 1) +
    geom_area(data = subset(df2, x >= -1 & x <= 1),
              aes(x = x, y = 0.5), fill = "gray", alpha = 0.5) +
    labs(x = 'x', y = expression(hat(mu)[P](x)), 
         title = paste('Empirical KME of Uniform(-1,1)')) +
    theme_minimal() +
    theme(text = element_text(size=16)) +
    ylim(0, 0.9) +
    annotate("text", x = 2, y = 0.55, label = "sigma == 1",
             parse = TRUE, color = 'blue', size = 5) +
    annotate("text", x = 2, y = 0.49, label = "sigma == 0.4",
             parse = TRUE, color = 'darkorange', size = 5) + 
    annotate("text", x = 2, y = 0.43, label = "sigma == 0.1",
             parse = TRUE, color = 'darkgreen', size = 5)
  
  print(p2)

But isn’t this just a kernel density estimate (KDE)?

The form of the KME function is indeed a KDE with a Gaussian kernel. However, the KME is not meant to estimate the density function of $P$. Instead, it is meant to represent the distribution $P$ in the RKHS. Remember, the KME can be defined with any p.d. kernel, not necessarily a density function. On the other hand, the KDE can use any density kernel, not necessarily a p.d. kernel.
The KME contains all the information about the distribution $P$ (if the kernel is characteristic, which we will discuss later), while the KDE only estimates the density function, and also suffers from the curse of dimensionality.
The KME can be used for many other tasks, such as two-sample test

7 Characteristic Kernels

A natural question to ask is, can we recover the original distribution $P$ from its kernel mean embedding $\mu_P$? In general, the answer is no. This is very easy to understand because if the kernel is only linear or quadratic, then the embedding only captures the first or second moments of the distribution, which is not enough to recover the original distribution. However, if the kernel $k$ is characteristic, then the answer is yes. A p.d. kernel $k$ is characteristic if the map $P \mapsto \mu_P$ is injective, i.e.,

\[ \mu_P = \mu_Q \Longleftrightarrow P = Q. \]

for any two distributions $P$ and $Q$, we have $\mu_P = \mu_Q$ if and only if $P = Q$. Many commonly used kernels are characteristic, including the Gaussian kernel and the Laplace kernel. To actually recover the distribution $P$ from its embedding $\mu_P$ is a much harder problem. The following explains why the Gaussian kernel is characteristic. For translation-invariant kernels on $\mathbb{R}^d$ of the form \[ k(x,y) = \psi(x-y), \] Bochner’s theorem ¹ states that $k$ is positive definite if and only if $\psi$ is the Fourier transform of a finite nonnegative measure $\Lambda$, i.e., \[ k(x,y) = \int_{\mathbb{R}^d} e^{i\omega^\top(x-y)} \, d\Lambda(\omega). \]

Using this representation, the kernel mean embedding can be expressed in Fourier space as \[ \begin{aligned} \mu_P(x) &= \int_{\mathbb{R}^d}\!\!\left(\int_{\mathbb{R}^d} e^{i\omega^\top(x-z)}\,d\Lambda(\omega)\right)\!dP(z) \\ &= \int_{\mathbb{R}^d} e^{i\omega^\top x}\Big(\int e^{-i\omega^\top z}\,dP(z)\Big)\,d\Lambda(\omega) \\ &= \int_{\mathbb{R}^d} e^{i\omega^\top x}\,P(-\omega)\,d\Lambda(\omega) \\ &= \int_{\mathbb{R}^d} e^{-i\omega^\top x}\,P(\omega)\,d\Lambda(\omega) \end{aligned} \]

where the last equality uses that $k$ is real $\;\Rightarrow\;$ $d\Lambda(\omega)=d\Lambda(-\omega)$ (spectral symmetry) and $P(-\omega)=\overline{P(\omega)}$ is the characteristic function $E_{X\sim P}[e^{i\omega^\top X}]$ of $P$. We can then recover the characteristic function and also the distribution $P$ from $\mu_P$. But this is actually a very unstable operation in practice and not very commonly used. The main purpose of KME is to represent the distribution in the RKHS and use it for other tasks. The following is one of the most important applications of KME.

8 Application: Maximum Mean Discrepancy

A key application of KME is to compare two distributions. If we have distributions $P$ and $Q$, we can first calculate their kernel mean embeddings $\mu_P$ and $\mu_Q$. Then we compare these two embeddings in the RKHS in the sense that

\[ \mathrm{MMD}(P,Q;\cH) \;=\; \|\mu_P - \mu_Q\|_{\cH}. \]

This is called the maximum mean discrepancy (MMD) between $P$ and $Q$ with respect to the RKHS $\cH$. It can be shown that if the kernel $k$ is characteristic, then $\mathrm{MMD}(P,Q;\cH)=0$ if and only if $P=Q$. Hence MMD can be used as a metric on probability distributions. However, even if the kernel is not characteristic, MMD is still a useful discrepancy measure between two distributions. Formally, we can understand MMD using the following definition:

\[ \text{MMD}(P,Q;\cH) \;=\; \sup_{\|f\|_{\cH} \leq 1} \Big( \E_{X \sim P}[f(X)] - \E_{Y \sim Q}[f(Y)] \Big). \]

This means that, if we are interested in any functions of the variable, say $f(X)$, then the largest difference in expectation of $f(X)$ when $X$ is drawn from $P$ versus $Q$ is the MMD between $P$ and $Q$. If the two distributions are very different, then there exists some function $f$ that can amplify this difference. Note here that we constrain the function $f$ to have RKHS norm at most 1, so that we avoid the trivial solution of making $f$ arbitrarily large. Let’s look at the difference between these two expectations, using the reproducing property of the RKHS:

\[ \begin{aligned} \E_{X \sim P}[f(X)] - \E_{Y \sim Q}[f(Y)] &= \langle f, \mu_P \rangle_{\cH} - \langle f, \mu_Q \rangle_{\cH} \\ &= \langle f, \mu_P - \mu_Q \rangle_{\cH} \\ &\leq \|f\|_{\cH} \|\mu_P - \mu_Q\|_{\cH} \end{aligned} \]

Hence the supremum is attained when $f$ is in the direction of $\mu_P - \mu_Q$ with unit norm, i.e.,

\[ \begin{aligned} &\max_{\|f\|_{\cH} \leq 1} \Big( \E_{X \sim P}[f(X)] - \E_{Y \sim Q}[f(Y)] \Big) \\ =& \langle \frac{\mu_P - \mu_Q}{\|\mu_P - \mu_Q\|_{\cH}}, \mu_P - \mu_Q \rangle_{\cH} \\ =& \|\mu_P - \mu_Q\|_{\cH}. \end{aligned} \]

This is not straightforward to see how this can be computed in practice. However, by using the squared RKHS norm, and the reproducing property, we can see that

\[ \begin{aligned} \text{MMD}^2 &= \langle \mu_P - \mu_Q, \mu_P - \mu_Q \rangle_{\cH} \\ &=\langle \mu_P, \mu_P \rangle_{\cH} + \langle \mu_Q, \mu_Q \rangle_{\cH} - 2 \langle \mu_P, \mu_Q \rangle_{\cH}\\ &= \langle \int k(\cdot, x) \, dP(x), \int k(\cdot, x') \, dP(x') \rangle_{\cH} + \langle \int k(\cdot, y) \, dQ(y), \int k(\cdot, y') \, dQ(y') \rangle_{\cH} \\ &\quad - 2 \langle \int k(\cdot, x) \, dP(x), \int k(\cdot, y) \, dQ(y) \rangle_{\cH} \\ &= \iint k(x, x') \, dP(x) dP(x') + \iint k(y, y') \, dQ(y) dQ(y') \\ &\quad - 2 \iint k(x, y) \, dP(x) dQ(y) \\ &= \E_{X,X' \sim P}[k(X,X')] + \E_{Y,Y' \sim Q}[k(Y,Y')] - 2 \, \E_{X \sim P,\,Y \sim Q}[k(X,Y)] \end{aligned} \]

Given i.i.d. samples $\{x_i\}_{i=1}^n \sim P$ and $\{y_j\}_{j=1}^m \sim Q$, the estimator is

\[ \widehat{\mathrm{MMD}}^2 \;=\; \frac{1}{n(n-1)}\sum_{i \neq i'} k(x_i,x_{i'}) + \frac{1}{m(m-1)}\sum_{j \neq j'} k(y_j,y_{j'}) - \frac{2}{mn}\sum_{i,j} k(x_i,y_j). \]

Let’s look at an empirical example of using MMD to compare two distributions. We will use the same Uniform$[-1,1]$ distribution as $P$, and a Gaussian distribution $Q = \cN(0, 1)$ as an alternative. We will use the Gaussian kernel with bandwidth $\sigma = 0.3$. Below is the plot of the two distributions along with their samples.

  set.seed(123)
  
  # Gaussian kernel
  kfun <- function(x, y, sigma=1) {
    exp(-(outer(x, y, "-")^2) / (2 * sigma^2))
  }
  
  mmd_unbiased <- function(x, y, sigma=1) {
    m <- length(x); n <- length(y)
    Kxx <- kfun(x, x, sigma)
    Kyy <- kfun(y, y, sigma)
    Kxy <- kfun(x, y, sigma)
    
    term_xx <- (sum(Kxx) - sum(diag(Kxx))) / (m * (m - 1))
    term_yy <- (sum(Kyy) - sum(diag(Kyy))) / (n * (n - 1))
    term_xy <- mean(Kxy)
    
    term_xx + term_yy - 2 * term_xy
  }
  
  # Example: Gaussian vs Uniform
  m <- n <- 200
  x <- rnorm(m, mean = 0, sd = 1)      # Gaussian N(0,1)
  y <- runif(n, min = -1, max = 1)     # Uniform(-1,1)
  
  # plot the two sets of samples
  df_samples <- data.frame(value = c(x, y),
                           group = rep(c('Gaussian', 'Uniform'), each = m))
  p_samples <- ggplot(df_samples, aes(x = value, fill = group)) +
    geom_histogram(aes(y = ..density..), position = 'identity', alpha = 0.5, bins = 30) +
    labs(x = 'Value', y = 'Density', title = 'Samples from Two Distributions') +
    theme_minimal() +
    theme(text = element_text(size=16)) +
    scale_fill_manual(values = c('Gaussian' = 'darkorange', 'Uniform' = 'lightblue'))
  print(p_samples)
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

  
  
  mmd_val <- mmd_unbiased(x, y, sigma = 0.3)
  print(mmd_val)
## [1] 0.02625334

But how do we know if this value is large or small? We could consider a permutation test to get the null distribution of the MMD statistic under the null hypothesis that $P=Q$. Below is the code to perform the permutation test and the resulting plot of the null distribution along with the observed MMD value.

  set.seed(1234)
  
  nperm <- 500
  mmd_perm <- numeric(nperm)
  xy <- c(x, y)
  for (b in 1:nperm) {
    perm_idx <- sample(1:(m+n), m+n, replace = FALSE)
    x_perm <- xy[perm_idx[1:m]]
    y_perm <- xy[perm_idx[(m+1):(m+n)]]
    mmd_perm[b] <- mmd_unbiased(x_perm, y_perm, sigma = 0.3)
  }
  
  p_value <- mean(mmd_perm >= mmd_val)
  print(p_value)
## [1] 0.002
  
  df_mmd <- data.frame(mmd = mmd_perm)
  
  p_mmd <- ggplot(df_mmd, aes(x = mmd)) +
    geom_histogram(aes(y = ..density..), bins = 30, fill = 'lightblue', color = 'black') +
    geom_vline(xintercept = mmd_val, color = 'red', size = 1) +
    labs(x = expression(hat(MMD)^2), 
         y = 'Density', 
         title = paste('Permutation Test for MMD (p-value =', round(p_value, 4), ')')) +
    theme_minimal() +
    theme(text = element_text(size=16)) +
    annotate("text", x = mmd_val - 0.005, y = max(density(mmd_perm)$y)*0.8, 
             label = "Observed MMD", color = 'red', size = 5)
  
  print(p_mmd)

MMD has very interesting applications, including two-sample test, and also conditional independence test. It is also used in training some generative models, such as the MMD-GAN. Interested readers can refer to (Li et al. 2017) and (Muandet et al. 2017) for more details.

Li, Chun-Liang, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. 2017. “Mmd Gan: Towards Deeper Understanding of Moment Matching Network.” Advances in Neural Information Processing Systems 30.

Muandet, Krikamol, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Schölkopf, et al. 2017. “Kernel Mean Embedding of Distributions: A Review and Beyond.” Foundations and Trends in Machine Learning 10 (1-2): 1–141.

This is a different way than the Mercer’s Theorem to represent a positive definite kernel, but only for translation-invariant ones. Mercer’s Theorem gives orthogonal basis expansion of the kernel, while Bochner’s Theorem gives a Fourier representation of the kernel.↩︎