Basic Concepts

Suppose that we have a mixture of two populations, generated from \({\cal N}(-2, 1)\) and \({\cal N}(2, 1)\), respective. We observe \(n = 1000\) observations, while the first population consists of 25% of the sample. However, when we acutrally observe the data, we do NOT know the underlying data generator. The following data are observed, with a density plot:

  library(lattice)

  set.seed(1)
  n = 1000
  x1 = rnorm(n, mean=-2)
  x2 = rnorm(n, mean=2)
  z = (runif(n) <= 0.25) # an indicator for group membership
  x = ifelse(z, x2, x1) # the observed data
  
  par(mar=rep(2,4))
  densityplot(x, cex = 0.5, pch = 19, 
              par.settings = list(plot.symbol = list(col= ifelse(z, "darkorange", "deepskyblue"))))

Of course we wouldn’t know the color (group membership) of the observations. Based on what was discribed above, our observed data is generated from a mixture of Gaussian distributions, with the density function (assuming that the variance of both groups are known to be 1)

\[p(x) = (1 - \pi) \, \phi_{\mu_1}(x) + \pi \, \phi_{\mu_2}(x),\] where \(\phi_{\mu}(x)\) is the density function of \({\cal N}(\mu, 1)\). We might choose to optimize this objective function, which could be complicated in certain situations. Hence, here we will consider a different approach. Let’s assume that there is an latent group indicator \(Z\) for each observation. We assume that \(Z\) follows a bernulli distribution with probability \(\pi\). Then the likelihood of the entire data \(\{x_i, z_i\}_{i = 1}^n\) becomes

\[\begin{align} \ell(\mathbf{x}, \mathbf{z} | \boldsymbol \theta) =&~ \sum_{i=1}^n \big[ (1 - z_i) \log \phi_{\mu_1}(x_i) + z_i \log \phi_{\mu_2}(x_i) \big] + \sum_{i=1}^n \big[ (1 - z_i) \log (1-\pi) + z_i \log \pi \big], \end{align}\]

where \(\mathbf{x}\) is the vector of all observed \(X\), \(\mathbf{z}\) is the vector all observed \(Z\), and \(\boldsymbol \theta\) represents all unknow parameters, in this case, just \(\mu_1\), \(\mu_2\), and \(\pi\). The first part comes from the conditional likelihood of \(p(\mathbf{x} | \mathbf{z}, \boldsymbol \theta)\), while the second part is \(p(\mathbf{z} | \boldsymbol \theta)\).

Based on this new setting, the EM algorithm iterate between two steps:

The two steps in our example are given by

This can be implemented very easily with the following code:

  # lets setup some initial values

  hat_PI = 0.5
  hat_mu1 = -0.25
  hat_mu2 = 0.25
  
  # perform EM 
  
  for( i in 1:10 ) {
    # E step
    # calculate the conditional distribution of the hidden variable z
    d1 = (1-hat_PI) * dnorm(x, mean= hat_mu1)
    d2 = hat_PI * dnorm(x, mean= hat_mu2)
    ez = d2/(d1 + d2)
    
    # M-step
    # based on the conditional distribution, calculate the new MLE of the parameters
    hat_PI = mean(ez)
    hat_mu1 = sum( (1-ez) * x ) / sum(1-ez) 
    hat_mu2 = sum( ez * x ) / sum(ez)
  
    # print the current estimates
    print( c(hat_mu1, hat_mu2, hat_PI) )
  }
## [1] -1.7424035  0.1277127  0.3877030
## [1] -2.1850469  1.1835122  0.3466446
## [1] -2.1304023  1.6958100  0.2909009
## [1] -2.0607891  1.9573795  0.2596793
## [1] -2.0244826  2.0758484  0.2456213
## [1] -2.0083050  2.1249130  0.2397529
## [1] -2.0015859  2.1446105  0.2373819
## [1] -1.9988787  2.1524340  0.2364372
## [1] -1.997801  2.155529  0.236063
## [1] -1.997375  2.156752  0.235915

Hence, we eventrually settled on an estimation with \(\widehat \mu_1 = -1.997\) and \(\widehat \mu_2 = 2.157\), and \(\widehat \pi = 0.236\). And it converges very fast.