Chapter 12 Logistic Regression

12.1 Modeling Binary Outcomes

To model binary outcomes using a logistic regression, we will use the 0/1 coding of \(Y\). We need to set its connection with covariates. Recall in a linear regression, the outcome is continuous, and we set

\[Y = \beta_0 + \beta_1 X + \epsilon\] However, this does not work for classification since \(Y\) can only be 0 or 1. Hence we turn to consider modeling the probability \(P(Y = 1 | X = \mathbf{x})\). Hence, \(Y\) is a Bernoulli random variable given \(X\), and this is modeled by a function of \(X\):

\[ P(Y = 1 | X = \mathbf{x}) = \frac{\exp(\mathbf{x}^\text{T}\boldsymbol{\beta})}{1 + \exp(\mathbf{x}^\text{T}\boldsymbol{\beta})}\] Note that although \(\mathbf{x}^\text{T}\boldsymbol{\beta}\) may ranges from 0 to infinity as \(X\) changes, the probability will still be bounded between 0 and 1. This is an example of Generalized Linear Models. The relationship is still represented using a linear function of \(\mathbf{x}\), \(\mathbf{x}^\text{T}\boldsymbol{\beta}\). This is called a logit link function (a function to connect the conditional expectation of \(Y\) with \(\boldsymbol{\beta}^\text{T}\mathbf{x}\)):

\[\eta(a) = \frac{\exp(a)}{1 + \exp(a)}\] Hence, \(P(Y = 1 | X = \mathbf{x}) = \eta(\mathbf{x}^\text{T}\boldsymbol{\beta})\). We can reversely solve this and get

\[\begin{aligned} P(Y = 1 | X = \mathbf{x}) = \eta(\mathbf{x}^\text{T}\boldsymbol{\beta}) &= \frac{\exp(\mathbf{x}^\text{T}\boldsymbol{\beta})}{1 + \exp(\mathbf{x}^\text{T}\boldsymbol{\beta})}\\ 1 - \eta(\mathbf{x}^\text{T}\boldsymbol{\beta}) &= \frac{1}{1 + \exp(\mathbf{x}^\text{T}\boldsymbol{\beta})} \\ \text{Odds} = \frac{\eta(\mathbf{x}^\text{T}\boldsymbol{\beta})}{1-\eta(\mathbf{x}^\text{T}\boldsymbol{\beta})} &= \exp(\mathbf{x}^\text{T}\boldsymbol{\beta})\\ \log(\text{Odds}) = \mathbf{x}^\text{T}\boldsymbol{\beta} \end{aligned}\]

Hence, the parameters in a logistic regression is explained as log odds. Let’s look at a concrete example.

12.2 Example: Cleveland Clinic Heart Disease Data

We use use the Cleveland clinic heart disease dataset. The goal is to model and predict a class label of whether the patient has a hearth disease or not. This is indicated by whether the num variable is \(0\) (no presence) or \(>0\) (presence).

  heart = read.csv("data/processed_cleveland.csv")
  heart$Y = as.factor(heart$num > 0)
  table(heart$Y)
## 
## FALSE  TRUE 
##   164   139

Let’s model the probability of heart disease using the Age variable. This can be done using the glm() function, which stands for the Generalized Linear Model. The syntax of glm() is almost the same as a linear model. Note that it is important to use family = binomial to specify the logistic regression.

  logistic.fit <- glm(Y~age, data = heart, family = binomial)
  summary(logistic.fit)
## 
## Call:
## glm(formula = Y ~ age, family = binomial, data = heart)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.596  -1.073  -0.835   1.173   1.705  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.00591    0.75913  -3.960  7.5e-05 ***
## age          0.05199    0.01367   3.803 0.000143 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 417.98  on 302  degrees of freedom
## Residual deviance: 402.54  on 301  degrees of freedom
## AIC: 406.54
## 
## Number of Fisher Scoring iterations: 4

The result is similar to a linear regression, with some differences. The parameter estimate of age is 0.05199. It is positive, meaning that increasing age would increase the change of having heart disease. However, this does not mean that 1 year older would increase the change by 0.05. Since, by our previous formula, the probably is not directly expressed as \(\mathbf{x}^\text{T}\boldsymbol{\beta}\).

This calculation can be realized when predicting a new target point. Let’s consider a new subject with Age = 55. What is the predicted probability of heart disease? Based on our formula, we have

\[\beta_0 + \beta_1 X = -3.00591 + 0.05199 \times 55 = -0.14646\] And the estimated probability is

\[ P(Y = 1 | X) = \frac{\exp(\beta_0 + \beta_1 X)}{1 + \exp(\beta_0 + \beta_1 X)} = \frac{\exp(-0.14646)}{1 + \exp(-0.14646)} = 0.4634503\] Hence, the estimated probability for this subject is 46.3%. This can be done using R code. Please note that if you want to predict the probability, you need to specify type = "response". Otherwise, only \(\beta_0 + \beta_1 X\) is provided.

  testdata = data.frame("age" = 55)
  predict(logistic.fit, newdata = testdata)
##          1 
## -0.1466722
  predict(logistic.fit, newdata = testdata, type = "response")
##         1 
## 0.4633975

If we need to make a 0/1 decision about this subject, a natural idea is to see if the predicted probability is greater than 0.5. In this case, we would predict this subject as 0.

12.3 Interpretation of the Parameters

Recall that \(\mathbf{x}^\text{T}\boldsymbol{\beta}\) is the log odds, we can further interpret the effect of a single variable. Let’s define the following two, with an arbitrary age value \(a\):

  • A subject with age \(= a\)
  • A subject with age \(= a + 1\)

Then, if we look at the odds ratio corresponding to these two target points, we have

\[\begin{aligned} \text{Odds Ratio} &= \frac{\text{Odds in Group 2}}{\text{Odds in Group 1}}\\ &= \frac{\exp(\beta_0 + \beta_1 (a+1))}{\exp(\beta_0 + \beta_1 a)}\\ &= \frac{\exp(\beta_0 + \beta_1 a) \times \exp(\beta_1)}{\exp(\beta_0 + \beta_1 a)}\\ &= \exp(\beta_1) \end{aligned}\]

Taking \(\log\) on both sides, we have

\[\log(\text{Odds Ratio}) = \beta_1\]

Hence, the odds ratio between these two subjects (they differ only with one unit of age) can be directly interpreted as the exponential of the parameter of age. After taking the log, we can also say that

The parameter \(\beta\) of a varaible in a logistic regression represents the log of odds ratio associated with one-unit increase of this variable.

Please note that we usually do not be explicit about what this odds ratio is about (what two subject we are comparing). Because the interpretation of the parameter does not change regardless of the value \(a\), as long as the two subjects differ in one unit.

And also note that this conclusion is regardless of the values of other covaraites. When we have a multivariate model, as long as all other covariates are held the same, the previous derivation will remain the same.

12.4 Solving a Logistic Regression

The logistic regression is solved by maximizing the log-likelihood function. Note that the log-likelihood is given by

\[\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \log \, p(y_i | x_i, \boldsymbol{\beta}).\] Using the probabilities of Bernoulli distribution, we have

\[\begin{align} \ell(\boldsymbol{\beta}) =& \sum_{i=1}^n \log \left\{ \eta(\mathbf{x}_i)^{y_i} [1-\eta(\mathbf{x}_i)]^{1-y_i} \right\}\\ =& \sum_{i=1}^n y_i \log \frac{\eta(\mathbf{x}_i)}{1-\eta(\mathbf{x}_i)} + \log [1-\eta(\mathbf{x}_i)] \\ =& \sum_{i=1}^n y_i \mathbf{x}_i^\text{T}\boldsymbol{\beta}- \log [ 1 + \exp(\mathbf{x}_i^\text{T}\boldsymbol{\beta})] \end{align}\]

Since this objective function is relatively simple, we can use Newton’s method to update. The gradient is given by

\[\frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} =~ \sum_{i=1}^n y_i \mathbf{x}_i^\text{T}- \sum_{i=1}^n \frac{\exp(\mathbf{x}_i^\text{T}\boldsymbol{\beta}) \mathbf{x}_i^\text{T}}{1 + \exp(\mathbf{x}_i^\text{T}\boldsymbol{\beta})},\]

and the Hessian matrix is given by

\[\frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^\text{T}} =~ - \sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^\text{T}\eta(\mathbf{x}_i) [1- \eta(\mathbf{x}_i)].\] This leads to the update

\[\boldsymbol{\beta}^{\,\text{new}} = \boldsymbol{\beta}^{\,\text{old}} - \left[\frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^\text{T}}\right]^{-1} \frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}\]

12.5 Example: South Africa Heart Data

We use the South Africa heart data as a demonstration. The goal is to estimate the probability of chd, the indicator of coronary heart disease.

    library(ElemStatLearn)
    data(SAheart)
    
    heart = SAheart
    heart$famhist = as.numeric(heart$famhist)-1
    n = nrow(heart)
    p = ncol(heart)
    
    heart.full = glm(chd~., data=heart, family=binomial)
    round(summary(heart.full)$coef, dig=3)
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -6.151      1.308  -4.701    0.000
## sbp            0.007      0.006   1.135    0.256
## tobacco        0.079      0.027   2.984    0.003
## ldl            0.174      0.060   2.915    0.004
## adiposity      0.019      0.029   0.635    0.526
## famhist        0.925      0.228   4.061    0.000
## typea          0.040      0.012   3.214    0.001
## obesity       -0.063      0.044  -1.422    0.155
## alcohol        0.000      0.004   0.027    0.978
## age            0.045      0.012   3.728    0.000
    
    # fitted value 
    yhat = (heart.full$fitted.values>0.5)
    table(yhat, SAheart$chd)
##        
## yhat      0   1
##   FALSE 256  77
##   TRUE   46  83

Based on what we learned in class, we can solve this problem ourselves using numerical optimization. Here we will demonstrate an approach that uses general solver optim(). First, write the objective function of the logistic regression, for any value of \(\boldsymbol \beta\), \(\mathbf{X}\) and \(\mathbf{y}\).

    # the negative log-likelihood function of logistic regression 
    my.loglik <- function(b, x, y)
    {
        bm = as.matrix(b)
        xb =  x %*% bm
        # this returns the negative loglikelihood
        return(sum(y*xb) - sum(log(1 + exp(xb))))
    }

    # Gradient
    my.gradient <- function(b, x, y)
    {
        bm = as.matrix(b) 
        expxb =  exp(x %*% bm)
        return(t(x) %*% (y - expxb/(1+expxb)))
    }

Let’s check the result of this function for some arbitrary \(\boldsymbol \beta\) value.

    # prepare the data matrix, I am adding a column of 1 for intercept
    
    x = as.matrix(cbind("intercept" = 1, heart[, 1:9]))
    y = as.matrix(heart[,10])
    
    # check my function
    b = rep(0, ncol(x))
    my.loglik(b, x, y) # scalar
## [1] -320.234
    
    # check the optimal value and the likelihood
    my.loglik(heart.full$coefficients, x, y)
## [1] -236.07

Then we optimize this objective function

    # Use a general solver to get the optimal value
    # Note that we are doing maximization instead of minimization, 
    # we need to specify "fnscale" = -1
    optim(b, fn = my.loglik, gr = my.gradient, 
          method = "BFGS", x = x, y = y, control = list("fnscale" = -1))
## $par
##  [1] -6.150733305  0.006504017  0.079376464  0.173923988  0.018586578  0.925372019  0.039595096
##  [8] -0.062909867  0.000121675  0.045225500
## 
## $value
## [1] -236.07
## 
## $counts
## function gradient 
##       74       16 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

This matches our glm() solution. Now, if we do not have a general solver, we should consider using the Newton-Raphson. You need to write a function to calculate the Hessian matrix and proceed with an optimization update.

    # my Newton-Raphson method
    # set up an initial value
    # this is sometimes crucial...
    
    b = rep(0, ncol(x))
    
    mybeta = my.logistic(b, x, y, tol = 1e-10, maxitr = 20, 
                         gr = my.gradient, hess = my.hessian, verbose = TRUE)
## at iteration 1, current beta is 
## -4.032 0.005 0.066 0.133 0.009 0.694 0.024 -0.045 -0.001 0.027
## at iteration 2, current beta is 
## -5.684 0.006 0.077 0.167 0.017 0.884 0.037 -0.061 0 0.041
## at iteration 3, current beta is 
## -6.127 0.007 0.079 0.174 0.019 0.924 0.039 -0.063 0 0.045
## at iteration 4, current beta is 
## -6.151 0.007 0.079 0.174 0.019 0.925 0.04 -0.063 0 0.045
## at iteration 5, current beta is 
## -6.151 0.007 0.079 0.174 0.019 0.925 0.04 -0.063 0 0.045
## at iteration 6, current beta is 
## -6.151 0.007 0.079 0.174 0.019 0.925 0.04 -0.063 0 0.045
## at iteration 7, current beta is 
## -6.151 0.007 0.079 0.174 0.019 0.925 0.04 -0.063 0 0.045
    
    # the parameter value
    mybeta
##                    [,1]
## intercept -6.1507208650
## sbp        0.0065040171
## tobacco    0.0793764457
## ldl        0.1739238981
## adiposity  0.0185865682
## famhist    0.9253704194
## typea      0.0395950250
## obesity   -0.0629098693
## alcohol    0.0001216624
## age        0.0452253496
    # get the standard error estimation 
    mysd = sqrt(diag(solve(-my.hessian(mybeta, x, y))))    

With this solution, I can then get the standard errors and the p-value. You can check them with the glm() function solution.

    # my summary matrix
    round(data.frame("beta" = mybeta, "sd" = mysd, "z" = mybeta/mysd, 
          "pvalue" = 2*(1-pnorm(abs(mybeta/mysd)))), dig=5)
##               beta      sd        z  pvalue
## intercept -6.15072 1.30826 -4.70145 0.00000
## sbp        0.00650 0.00573  1.13500 0.25637
## tobacco    0.07938 0.02660  2.98376 0.00285
## ldl        0.17392 0.05966  2.91517 0.00355
## adiposity  0.01859 0.02929  0.63458 0.52570
## famhist    0.92537 0.22789  4.06053 0.00005
## typea      0.03960 0.01232  3.21382 0.00131
## obesity   -0.06291 0.04425 -1.42176 0.15509
## alcohol    0.00012 0.00448  0.02714 0.97835
## age        0.04523 0.01213  3.72846 0.00019
      
    # check that with the glm fitting 
    round(summary(heart.full)$coef, dig=5)
##             Estimate Std. Error  z value Pr(>|z|)
## (Intercept) -6.15072    1.30826 -4.70145  0.00000
## sbp          0.00650    0.00573  1.13500  0.25637
## tobacco      0.07938    0.02660  2.98376  0.00285
## ldl          0.17392    0.05966  2.91517  0.00355
## adiposity    0.01859    0.02929  0.63458  0.52570
## famhist      0.92537    0.22789  4.06053  0.00005
## typea        0.03960    0.01232  3.21382  0.00131
## obesity     -0.06291    0.04425 -1.42176  0.15509
## alcohol      0.00012    0.00448  0.02714  0.97835
## age          0.04523    0.01213  3.72846  0.00019

12.6 Penalized Logistic Regression

Similar to a linear regression, we can also apply penalties to a logistic regression to address collinearity problems or select variables in a high-dimensional setting. For example, if we use the Lasso penalty, the objective function is

\[\sum_{i=1}^n \log \, p(y_i | x_i, \boldsymbol{\beta}) + \lambda |\boldsymbol{\beta}|_1\] This can be done using the glmnet package. Specifying family = "binomial" will ensure that a logistic regression is used, even your y is not a factor (but as numerical 0/1).

  library(glmnet)
  lasso.fit = cv.glmnet(x = data.matrix(SAheart[, 1:9]), y = SAheart[,10], 
                        nfold = 10, family = "binomial")
  
  plot(lasso.fit)

The procedure is essentially the same as in a linear regression. And we could obtain the estimated parameters by selecting the best \(\lambda\) value.

  coef(lasso.fit, s = "lambda.min")
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept) -6.319899679
## sbp          0.003381401
## tobacco      0.067870752
## ldl          0.137863725
## adiposity    .          
## famhist      0.777513058
## typea        0.026945871
## obesity     -0.008357962
## alcohol      .          
## age          0.042535589