Chapter 11 Discriminant Analysis

When we model the probability of \(Y\) given \(X\), such as using a logistic regression, the approach is often called a soft classification, meaning that we do not directly produce the class label for prediction. However, we can also view the task as finding a function, with 0/1 as the output. In this case, the function is called a classifier:

\[f : \mathbb{R}^p \longrightarrow \{0, 1\}\] In this case, we can directly evaluate the prediction error, which is calculated from the 0-1 loss:

\[L\big(f(\mathbf{x}), y \big) = \begin{cases} 0 \quad \text{if} \quad y = f(\mathbf{x})\\ 1 \quad \text{o.w.} \end{cases}\]

The goal is to minimize the overall risk, the integrated loss:

\[\text{R}(f) = \text{E}_{X, Y} \left[ L\big(f(X), Y\big) \right]\] Continuing the notation from the logistic regression, with \(\eta(\mathbf{x}) = \text{P}(Y = 1 | X = \mathbf{x})\), we can easily see the decision rule to minimize the risk is to take the dominate label for any given \(\mathbf{x}\), this leads to the Bayes rule:

\[\begin{align} f_B(\mathbf{x}) = \underset{f}{\arg\min} \,\, \text{R}(f) = \begin{cases} 1 & \text{if} \quad \eta(\mathbf{x}) \geq 1/2 \\ 0 & \text{if} \quad \eta(\mathbf{x}) < 1/2. \\ \end{cases} \end{align}\]

Note that it doesn’t matter when \(\eta(\mathbf{x}) = 1/2\) since we will make 50% mistake anyway. The risk associated with this rule is called the Bayes risk, which is the best risk we could achieve with a classification model with 0/1 loss.

11.1 Bayes Rule

The essential idea of Discriminant Analysis is to estimate the densities functions of each class, and compare the densities at any given target point to perform classification. Let’s construct the Bayes rule from the Bayes prospective:

\[\begin{align} \text{P}(Y = 1 | X = \mathbf{x}) &= \frac{\text{P}(X = \mathbf{x}| Y = 1)\text{P}(Y = 1)}{\text{P}(X = \mathbf{x})} \\ \text{P}(Y = 0 | X = \mathbf{x}) &= \frac{\text{P}(X = \mathbf{x}| Y = 0)\text{P}(Y = 0)}{\text{P}(X = \mathbf{x})} \end{align}\]

Lets further define marginal probabilities (prior) \(\pi_1 = P(Y = 1)\) and \(\pi_0 = 1 - \pi_1 = P(Y = 0)\), then, denote the conditional densities of \(X\) as

\[\begin{align} f_1 = \text{P}(X = \mathbf{x}| Y = 1)\\ f_0 = \text{P}(X = \mathbf{x}| Y = 0)\\ \end{align}\]

Note that the Bayes rule suggests to make the decision 1 when \(\eta(\mathbf{x}) \geq 1/2\), this is equivalent to \(\pi_1 > \pi_0\). Utilizing the Bayes Theorem, we have

\[\begin{align} f_B(\mathbf{x}) = \underset{f}{\arg\min} \,\, \text{R}(f) = \begin{cases} 1 & \text{if} \quad \pi_1 f_1(\mathbf{x}) \geq \pi_0 f_0(\mathbf{x}) \\ 0 & \text{if} \quad \pi_1 f_1(\mathbf{x}) < \pi_0 f_0(\mathbf{x}). \\ \end{cases} \end{align}\]

This suggests that we can model the conditional density of \(X\) given \(Y\) instead of modeling \(P(Y | X)\) to make the decision.

11.2 Example: Linear Discriminant Analysis (LDA)

We create two density functions that use the same covariance matrix: \(X_1 \sim \cal{N}(\mu_1, \Sigma)\) and \(X_2 \sim \cal{N}(\mu_2, \Sigma)\), with \(\mu_1 = (0.5, -1)^\text{T}\), \(\mu_2 = (-0.5, 0.5)^\text{T}\), and \(\Sigma_{2\times2}\) has diagonal elements 1 and off diagonal elements 0.5. Let’s first generate some observations.

  library(mvtnorm)
  set.seed(1)
  
  # generate two sets of samples
  Sigma = matrix(c(1, 0.5, 0.5, 1), 2, 2)
  mu1 = c(0.5, -1)
  mu2 = c(-0.5, 0.5)
  
  # define prior
  p1 = 0.4 
  p2 = 0.6
    
  n = 1000
  
  Class1 = rmvnorm(n*p1, mean = mu1, sigma = Sigma)
  Class2 = rmvnorm(n*p2, mean = mu2, sigma = Sigma)

  plot(Class1, pch = 19, col = "darkorange", xlim = c(-4, 4), ylim = c(-4, 4))
  points(Class2, pch = 19, col = "deepskyblue")

If we know their true density functions, then the decision line is linear.

11.3 Linear Discriminant Analysis

As we demonstrated earlier using the Bayes rule, the conditional probability can be formulated using Bayes Theorem. For this time, we will assume in generate that there are \(K\) classes instead of just two. However, the notation are similar to the previous case:

\[\begin{align} \text{P}(Y = k | X = \mathbf{x}) =&~ \frac{\text{P}(X = \mathbf{x}| Y = k)\text{P}(Y = k)}{\text{P}(X = \mathbf{x})}\\ =&~ \frac{\text{P}(X = \mathbf{x}| Y = k)\text{P}(Y = k)}{\sum_{l=1}^K \text{P}(X = \mathbf{x}| Y = l) \text{P}(Y = l)}\\ =&~ \frac{\pi_k f_k(\mathbf{x})}{\sum_{l=1}^K \pi_l f_l(\mathbf{x})} \end{align}\]

Given any target point \(\mathbf{x}\), the best prediction is simply picking the one that maximizes the posterior

\[\underset{k}{\arg\max} \,\, \pi_k f_k(x)\] Both LDA and QDA model \(f_k\) as a normal density function. Suppose we model each class density as multivariate Gaussian \({\cal N}(\boldsymbol{\mu}_k, \Sigma_k)\), and . Then

\[f_k(x) = \frac{1}{(2\pi)^{p/2} |\Sigma|^{1/2}} \exp\left[ -\frac{1}{2} (\mathbf{x}- \boldsymbol{\mu}_k)^\text{T}\Sigma^{-1} (\mathbf{x}- \boldsymbol{\mu}_k) \right].\] The log-likelihood function for the conditional distribution is

\[\begin{align} \log f_k(\mathbf{x}) =&~ -\log \big((2\pi)^{p/2} |\Sigma|^{1/2} \big) - \frac{1}{2} (\mathbf{x}- \boldsymbol{\mu}_k)^\text{T}\Sigma^{-1} (\mathbf{x}- \boldsymbol{\mu}_k) \\ =&~ - \frac{1}{2} (\mathbf{x}- \boldsymbol{\mu}_k)^\text{T}\Sigma^{-1} (\mathbf{x}- \boldsymbol{\mu}_k) + \text{Constant} \end{align}\]

The maximum a posteriori probability (MAP) estimate is simply

\[\begin{align} \widehat f(\mathbf{x}) =& ~\underset{k}{\arg\max} \,\, \log \big( \pi_k f_k(\mathbf{x}) \big) \\ =& ~\underset{k}{\arg\max} \,\, - \frac{1}{2} (\mathbf{x}- \boldsymbol{\mu}_k)^\text{T}\Sigma^{-1} (\mathbf{x}- \boldsymbol{\mu}_k) + \log(\pi_k) \end{align}\]

The term \((\mathbf{x}- \boldsymbol{\mu}_k)^\text{T}\Sigma^{-1} (\mathbf{x}- \boldsymbol{\mu}_k)\) is simply the between \(x\) and the centroid \(\boldsymbol{\mu}_k\) for class \(k\). Hence, this is essentially classifying \(x\) to the class label with the closest centroid (after adjusting the for prior). This sets a connection with the \(k\)NN algorithm. A special case is that when \(\Sigma = \mathbf{I}\), i.e., only Euclidean distance is needed, and we have

\[\underset{k}{\arg\max} \,\, - \frac{1}{2} \lVert x - \boldsymbol{\mu}_k \rVert^2 + \log(\pi_k)\]

The decision boundary of LDA, as its name suggests, is a linear function of \(\mathbf{x}\). To see this, let’s look at the terms in the MAP. Note that anything that does not depends on the class index \(k\) is irrelevant to the decision.

\[\begin{align} & - \frac{1}{2} (\mathbf{x}- \boldsymbol{\mu}_k)^\text{T}\Sigma^{-1} (\mathbf{x}- \boldsymbol{\mu}_k) + \log(\pi_k)\\ =&~ \mathbf{x}^\text{T}\Sigma^{-1} \boldsymbol{\mu}_k - \frac{1}{2}\boldsymbol{\mu}_k^\text{T}\Sigma^{-1} \boldsymbol{\mu}_k + \log(\pi_k) \text{irrelevant terms} \\ =&~ \mathbf{x}^\text{T}\mathbf{w}_k + b_k + \text{irrelevant terms} \end{align}\]

Then, the decision boundary between two classes, \(k\) and \(l\) is

\[\begin{align} \mathbf{x}^\text{T}\mathbf{w}_k + b_k &= \mathbf{x}^\text{T}\mathbf{w}_l + b_l \\ \Leftrightarrow \quad \mathbf{x}^\text{T}(\mathbf{w}_k - \mathbf{w}_l) + (b_k - b_l) &= 0, \\ \end{align}\]

which is a linear function of \(\mathbf{x}\). The previous density plot already showed this effect. Estimating the parameters in LDA is very simple:

Prior probabilities: \(\widehat{\pi}_k = n_k / n = n^{-1} \sum_k \mathbf{1}\{y_i = k\}\), where \(n_k\) is the number of observations in class \(k\).
Centroids: \(\widehat{\boldsymbol{\mu}}_k = n_k^{-1} \sum_{i: \,y_i = k} x_i\)
Pooled covariance matrix: \[\widehat \Sigma = \frac{1}{n-K} \sum_{k=1}^K \sum_{i : \, y_i = k} (\mathbf{x}_i - \widehat{\boldsymbol{\mu}}_k) (\mathbf{x}_i - \widehat{\boldsymbol{\mu}}_k)^\text{T}\]

11.4 Example: Quadratic Discriminant Analysis (QDA)

When we assume that each class has its own covariance structure, the decision boundary will not be linear anymore. Let’s visualize this by creating two density functions that use different covariance matrices.

11.5 Quadratic Discriminant Analysis

QDA simply abandons the assumption of the common covariance matrix. Hence, \(\Sigma_k\)’s are not equal. In this case, the determinant \(|\Sigma_k|\) of each covariance matrix will be different. In addition, the MAP decision becomes a quadratic function of the target point \(\mathbf{x}\)

\[\begin{align} & \underset{k}{\max} \,\, \log \big( \pi_k f_k(x) \big) \\ =& ~\underset{k}{\max} \,\, -\frac{1}{2} \log |\Sigma_k| - \frac{1}{2} (\mathbf{x}- \boldsymbol{\mu}_k)^\text{T}\Sigma_k^{-1} (\mathbf{x}- \boldsymbol{\mu}_k) + \log(\pi_k) \\ =& \mathbf{x}^\text{T}\mathbf{W}_k \mathbf{x}+ \mathbf{w}_k^\text{T}\mathbf{x}+ b_k \end{align}\]

This leads to quadratic decision boundary between class \(k\) and \(l\)

\[\big\{\mathbf{x}: \mathbf{x}^\text{T}(\mathbf{W}_k - \mathbf{W}_l) \mathbf{x}+ \mathbf{x}^\text{T}(\mathbf{w}_k - \mathbf{w}_l) + (b_k - b_l) = 0\big\}.\]

The estimation procedure is also similar:

Prior probabilities: \(\widehat{\pi}_k = n_k / n = n^{-1} \sum_k \mathbf{1}\{y_i = k\}\), where \(n_k\) is the number of observations in class \(k\).
Centroid: \(\widehat{\boldsymbol{\mu}}_k = n_k^{-1} \sum_{i: \,y_i = k} \mathbf{x}_i\)
Sample covariance matrix for each class: \[\widehat \Sigma_k = \frac{1}{n_k-1} \sum_{i : \, y_i = k} (\mathbf{x}_i - \widehat{\boldsymbol{\mu}}_k)(\mathbf{x}_i - \widehat{\boldsymbol{\mu}}_k)^\text{T}\]

11.6 Example: the Hand Written Digit Data

We first sample 100 data from both the training and testing sets.

    library(ElemStatLearn)
    # a plot of some samples 
    findRows <- function(zip, n) {
        # Find n (random) rows with zip representing 0,1,2,...,9
        res <- vector(length=10, mode="list")
        names(res) <- 0:9
        ind <- zip[,1]
        for (j in 0:9) {
        res[[j+1]] <- sample( which(ind==j), n ) }
        return(res) 
    }
    
    set.seed(1)
    
    # find 100 samples for each digit for both the training and testing data
    train.id <- findRows(zip.train, 100)
    train.id = unlist(train.id)
    
    test.id <- findRows(zip.test, 100)
    test.id = unlist(test.id)
    
    X = zip.train[train.id, -1]
    Y = zip.train[train.id, 1]
    dim(X)
## [1] 1000  256
    
    Xtest = zip.test[test.id, -1]
    Ytest = zip.test[test.id, 1]
    dim(Xtest)
## [1] 1000  256

We can then fit LDA and QDA and predict.

    # fit LDA
    library(MASS)
    
    dig.lda=lda(X,Y)
    Ytest.pred=predict(dig.lda, Xtest)$class
    table(Ytest, Ytest.pred)
##      Ytest.pred
## Ytest  0  1  2  3  4  5  6  7  8  9
##     0 92  0  2  2  0  0  1  0  3  0
##     1  0 94  0  0  4  0  2  0  0  0
##     2  2  2 66  7  5  2  4  2 10  0
##     3  2  0  3 75  2  8  0  3  6  1
##     4  0  4  2  1 76  1  3  2  2  9
##     5  2  0  3 10  0 79  0  0  3  3
##     6  0  0  4  1  3  4 86  0  1  1
##     7  0  0  0  2  5  0  0 87  0  6
##     8  2  0  4  5  6  7  1  0 72  3
##     9  0  0  0  1  4  0  0  5  0 90
    mean(Ytest != Ytest.pred)
## [1] 0.183

However, QDA does not work in this case because there are too many parameters

    dig.qda = qda(X, Y) # error message