Abstract
This is the supplementaryR
file for Gaussian Mixture Model
in the lecture note “EM”.
Suppose that we have a mixture of two populations, generated from \({\cal N}(-2, 1)\) and \({\cal N}(2, 1)\), respective. We observe \(n = 1000\) observations, while the first population consists of 25% of the sample. However, when we acutrally observe the data, we do NOT know the underlying data generator. The following data are observed, with a density plot:
library(lattice)
set.seed(1)
n = 1000
x1 = rnorm(n, mean=-2)
x2 = rnorm(n, mean=2)
z = (runif(n) <= 0.25) # an indicator for group membership
x = ifelse(z, x2, x1) # the observed data
par(mar=rep(2,4))
densityplot(x, cex = 0.5, pch = 19,
par.settings = list(plot.symbol = list(col= ifelse(z, "darkorange", "deepskyblue"))))
Of course we wouldn’t know the color (group membership) of the observations. Based on what was discribed above, our observed data is generated from a mixture of Gaussian distributions, with the density function (assuming that the variance of both groups are known to be 1)
\[p(x) = (1 - \pi) \, \phi_{\mu_1}(x) + \pi \, \phi_{\mu_2}(x),\] where \(\phi_{\mu}(x)\) is the density function of \({\cal N}(\mu, 1)\). We might choose to optimize this objective function, which could be complicated in certain situations. Hence, here we will consider a different approach. Let’s assume that there is an latent group indicator \(Z\) for each observation. We assume that \(Z\) follows a bernulli distribution with probability \(\pi\). Then the likelihood of the entire data \(\{x_i, z_i\}_{i = 1}^n\) becomes
\[\begin{align} \ell(\mathbf{x}, \mathbf{z} | \boldsymbol \theta) =&~ \sum_{i=1}^n \big[ (1 - z_i) \log \phi_{\mu_1}(x_i) + z_i \log \phi_{\mu_2}(x_i) \big] + \sum_{i=1}^n \big[ (1 - z_i) \log (1-\pi) + z_i \log \pi \big], \end{align}\]
where \(\mathbf{x}\) is the vector of all observed \(X\), \(\mathbf{z}\) is the vector all observed \(Z\), and \(\boldsymbol \theta\) represents all unknow parameters, in this case, just \(\mu_1\), \(\mu_2\), and \(\pi\). The first part comes from the conditional likelihood of \(p(\mathbf{x} | \mathbf{z}, \boldsymbol \theta)\), while the second part is \(p(\mathbf{z} | \boldsymbol \theta)\).
Based on this new setting, the EM algorithm iterate between two steps:
The two steps in our example are given by
This can be implemented very easily with the following code:
# lets setup some initial values
hat_PI = 0.5
hat_mu1 = -0.25
hat_mu2 = 0.25
# perform EM
for( i in 1:10 ) {
# E step
# calculate the conditional distribution of the hidden variable z
d1 = (1-hat_PI) * dnorm(x, mean= hat_mu1)
d2 = hat_PI * dnorm(x, mean= hat_mu2)
ez = d2/(d1 + d2)
# M-step
# based on the conditional distribution, calculate the new MLE of the parameters
hat_PI = mean(ez)
hat_mu1 = sum( (1-ez) * x ) / sum(1-ez)
hat_mu2 = sum( ez * x ) / sum(ez)
# print the current estimates
print( c(hat_mu1, hat_mu2, hat_PI) )
}
## [1] -1.7424035 0.1277127 0.3877030
## [1] -2.1850469 1.1835122 0.3466446
## [1] -2.1304023 1.6958100 0.2909009
## [1] -2.0607891 1.9573795 0.2596793
## [1] -2.0244826 2.0758484 0.2456213
## [1] -2.0083050 2.1249130 0.2397529
## [1] -2.0015859 2.1446105 0.2373819
## [1] -1.9988787 2.1524340 0.2364372
## [1] -1.997801 2.155529 0.236063
## [1] -1.997375 2.156752 0.235915
Hence, we eventrually settled on an estimation with \(\widehat \mu_1 = -1.997\) and \(\widehat \mu_2 = 2.157\), and \(\widehat \pi = 0.236\). And it converges very fast.