Chapter 10 Logistic Regression
10.1 Modeling Binary Outcomes
To model binary outcomes using a logistic regression, we will use the 0/1 coding of \(Y\). We need to set its connection with covariates. Recall in a linear regression, the outcome is continuous, and we set
\[Y = \beta_0 + \beta_1 X + \epsilon\] However, this does not work for classification since \(Y\) can only be 0 or 1. Hence we turn to consider modeling the probability \(P(Y = 1 | X = \mathbf{x})\). Hence, \(Y\) is a Bernoulli random variable given \(X\), and this is modeled by a function of \(X\):
\[ P(Y = 1 | X = \mathbf{x}) = \frac{\exp(\mathbf{x}^\text{T}\boldsymbol{\beta})}{1 + \exp(\mathbf{x}^\text{T}\boldsymbol{\beta})}\] Note that although \(\mathbf{x}^\text{T}\boldsymbol{\beta}\) may ranges from 0 to infinity as \(X\) changes, the probability will still be bounded between 0 and 1. This is an example of Generalized Linear Models. The relationship is still represented using a linear function of \(\mathbf{x}\), \(\mathbf{x}^\text{T}\boldsymbol{\beta}\). This is called a logit link function (a function to connect the conditional expectation of \(Y\) with \(\boldsymbol{\beta}^\text{T}\mathbf{x}\)):
\[\eta(a) = \frac{\exp(a)}{1 + \exp(a)}\] Hence, \(P(Y = 1 | X = \mathbf{x}) = \eta(\mathbf{x}^\text{T}\boldsymbol{\beta})\). We can reversely solve this and get
\[\begin{aligned} P(Y = 1 | X = \mathbf{x}) = \eta(\mathbf{x}^\text{T}\boldsymbol{\beta}) &= \frac{\exp(\mathbf{x}^\text{T}\boldsymbol{\beta})}{1 + \exp(\mathbf{x}^\text{T}\boldsymbol{\beta})}\\ 1 - \eta(\mathbf{x}^\text{T}\boldsymbol{\beta}) &= \frac{1}{1 + \exp(\mathbf{x}^\text{T}\boldsymbol{\beta})} \\ \text{Odds} = \frac{\eta(\mathbf{x}^\text{T}\boldsymbol{\beta})}{1-\eta(\mathbf{x}^\text{T}\boldsymbol{\beta})} &= \exp(\mathbf{x}^\text{T}\boldsymbol{\beta})\\ \log(\text{Odds}) = \mathbf{x}^\text{T}\boldsymbol{\beta} \end{aligned}\]Hence, the parameters in a logistic regression is explained as log odds. Let’s look at a concrete example.
10.2 Example: Cleveland Clinic Heart Disease Data
We use use the Cleveland clinic heart disease dataset. The goal is to model and predict a class label of whether the patient has a hearth disease or not. This is indicated by whether the num
variable is \(0\) (no presence) or \(>0\) (presence).
heart = read.csv("data/processed_cleveland.csv")
heart$Y = as.factor(heart$num > 0)
table(heart$Y)
##
## FALSE TRUE
## 164 139
Let’s model the probability of heart disease using the Age
variable. This can be done using the glm()
function, which stands for the Generalized Linear Model. The syntax of glm()
is almost the same as a linear model. Note that it is important to use family = binomial
to specify the logistic regression.
logistic.fit <- glm(Y~age, data = heart, family = binomial)
summary(logistic.fit)
##
## Call:
## glm(formula = Y ~ age, family = binomial, data = heart)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.00591 0.75913 -3.960 7.5e-05 ***
## age 0.05199 0.01367 3.803 0.000143 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.98 on 302 degrees of freedom
## Residual deviance: 402.54 on 301 degrees of freedom
## AIC: 406.54
##
## Number of Fisher Scoring iterations: 4
The result is similar to a linear regression, with some differences. The parameter estimate of age is 0.05199. It is positive, meaning that increasing age would increase the change of having heart disease. However, this does not mean that 1 year older would increase the change by 0.05. Since, by our previous formula, the probably is not directly expressed as \(\mathbf{x}^\text{T}\boldsymbol{\beta}\).
This calculation can be realized when predicting a new target point. Let’s consider a new subject with Age = 55
. What is the predicted probability of heart disease? Based on our formula, we have
\[\beta_0 + \beta_1 X = -3.00591 + 0.05199 \times 55 = -0.14646\] And the estimated probability is
\[ P(Y = 1 | X) = \frac{\exp(\beta_0 + \beta_1 X)}{1 + \exp(\beta_0 + \beta_1 X)} = \frac{\exp(-0.14646)}{1 + \exp(-0.14646)} = 0.4634503\]
Hence, the estimated probability for this subject is 46.3%. This can be done using R code. Please note that if you want to predict the probability, you need to specify type = "response"
. Otherwise, only \(\beta_0 + \beta_1 X\) is provided.
testdata = data.frame("age" = 55)
predict(logistic.fit, newdata = testdata)
## 1
## -0.1466722
predict(logistic.fit, newdata = testdata, type = "response")
## 1
## 0.4633975
If we need to make a 0/1 decision about this subject, a natural idea is to see if the predicted probability is greater than 0.5. In this case, we would predict this subject as 0.
10.3 Interpretation of the Parameters
Recall that \(\mathbf{x}^\text{T}\boldsymbol{\beta}\) is the log odds, we can further interpret the effect of a single variable. Let’s define the following two, with an arbitrary age value \(a\):
- A subject with
age
\(= a\) - A subject with
age
\(= a + 1\)
Then, if we look at the odds ratio corresponding to these two target points, we have
\[\begin{aligned} \text{Odds Ratio} &= \frac{\text{Odds in Group 2}}{\text{Odds in Group 1}}\\ &= \frac{\exp(\beta_0 + \beta_1 (a+1))}{\exp(\beta_0 + \beta_1 a)}\\ &= \frac{\exp(\beta_0 + \beta_1 a) \times \exp(\beta_1)}{\exp(\beta_0 + \beta_1 a)}\\ &= \exp(\beta_1) \end{aligned}\]Taking \(\log\) on both sides, we have
\[\log(\text{Odds Ratio}) = \beta_1\]
Hence, the odds ratio between these two subjects (they differ only with one unit of age
) can be directly interpreted as the exponential of the parameter of age
. After taking the log, we can also say that
The parameter \(\beta\) of a varaible in a logistic regression represents the log of odds ratio associated with one-unit increase of this variable.
Please note that we usually do not be explicit about what this odds ratio is about (what two subject we are comparing). Because the interpretation of the parameter does not change regardless of the value \(a\), as long as the two subjects differ in one unit.
And also note that this conclusion is regardless of the values of other covaraites. When we have a multivariate model, as long as all other covariates are held the same, the previous derivation will remain the same.
10.4 Solving a Logistic Regression
The logistic regression is solved by maximizing the log-likelihood function. Note that the log-likelihood is given by
\[\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \log \, p(y_i | x_i, \boldsymbol{\beta}).\] Using the probabilities of Bernoulli distribution, we have
\[\begin{align} \ell(\boldsymbol{\beta}) =& \sum_{i=1}^n \log \left\{ \eta(\mathbf{x}_i)^{y_i} [1-\eta(\mathbf{x}_i)]^{1-y_i} \right\}\\ =& \sum_{i=1}^n y_i \log \frac{\eta(\mathbf{x}_i)}{1-\eta(\mathbf{x}_i)} + \log [1-\eta(\mathbf{x}_i)] \\ =& \sum_{i=1}^n y_i \mathbf{x}_i^\text{T}\boldsymbol{\beta}- \log [ 1 + \exp(\mathbf{x}_i^\text{T}\boldsymbol{\beta})] \end{align}\]
Since this objective function is relatively simple, we can use Newton’s method to update. The gradient is given by
\[\frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} =~ \sum_{i=1}^n y_i \mathbf{x}_i^\text{T}- \sum_{i=1}^n \frac{\exp(\mathbf{x}_i^\text{T}\boldsymbol{\beta}) \mathbf{x}_i^\text{T}}{1 + \exp(\mathbf{x}_i^\text{T}\boldsymbol{\beta})},\]
and the Hessian matrix is given by
\[\frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^\text{T}} =~ - \sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^\text{T}\eta(\mathbf{x}_i) [1- \eta(\mathbf{x}_i)].\] This leads to the update
\[\boldsymbol{\beta}^{\,\text{new}} = \boldsymbol{\beta}^{\,\text{old}} - \left[\frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^\text{T}}\right]^{-1} \frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}\]
10.5 Example: South Africa Heart Data
We use the South Africa heart data as a demonstration. The goal is to estimate the probability of chd
, the indicator of coronary heart disease.
library(ElemStatLearn)
data(SAheart)
heart = SAheart
heart$famhist = as.numeric(heart$famhist)-1
n = nrow(heart)
p = ncol(heart)
heart.full = glm(chd~., data=heart, family=binomial)
round(summary(heart.full)$coef, dig=3)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.151 1.308 -4.701 0.000
## sbp 0.007 0.006 1.135 0.256
## tobacco 0.079 0.027 2.984 0.003
## ldl 0.174 0.060 2.915 0.004
## adiposity 0.019 0.029 0.635 0.526
## famhist 0.925 0.228 4.061 0.000
## typea 0.040 0.012 3.214 0.001
## obesity -0.063 0.044 -1.422 0.155
## alcohol 0.000 0.004 0.027 0.978
## age 0.045 0.012 3.728 0.000
# fitted value
yhat = (heart.full$fitted.values>0.5)
table(yhat, SAheart$chd)
##
## yhat 0 1
## FALSE 256 77
## TRUE 46 83
Based on what we learned in class, we can solve this problem ourselves using numerical optimization. Here we will demonstrate an approach that uses general solver optim()
. First, write the objective function of the logistic regression, for any value of \(\boldsymbol \beta\), \(\mathbf{X}\) and \(\mathbf{y}\).
# the negative log-likelihood function of logistic regression
my.loglik <- function(b, x, y)
{
bm = as.matrix(b)
xb = x %*% bm
# this returns the negative loglikelihood
return(sum(y*xb) - sum(log(1 + exp(xb))))
}
# Gradient
my.gradient <- function(b, x, y)
{
bm = as.matrix(b)
expxb = exp(x %*% bm)
return(t(x) %*% (y - expxb/(1+expxb)))
}
Let’s check the result of this function for some arbitrary \(\boldsymbol \beta\) value.
# prepare the data matrix, I am adding a column of 1 for intercept
x = as.matrix(cbind("intercept" = 1, heart[, 1:9]))
y = as.matrix(heart[,10])
# check my function
b = rep(0, ncol(x))
my.loglik(b, x, y) # scalar
## [1] -320.234
# check the optimal value and the likelihood
my.loglik(heart.full$coefficients, x, y)
## [1] -236.07
Then we optimize this objective function
# Use a general solver to get the optimal value
# Note that we are doing maximization instead of minimization,
# we need to specify "fnscale" = -1
optim(b, fn = my.loglik, gr = my.gradient,
method = "BFGS", x = x, y = y, control = list("fnscale" = -1))
## $par
## [1] -6.150733305 0.006504017 0.079376464 0.173923988 0.018586578 0.925372019 0.039595096
## [8] -0.062909867 0.000121675 0.045225500
##
## $value
## [1] -236.07
##
## $counts
## function gradient
## 74 16
##
## $convergence
## [1] 0
##
## $message
## NULL
This matches our glm()
solution. Now, if we do not have a general solver, we should consider using the Newton-Raphson. You need to write a function to calculate the Hessian matrix and proceed with an optimization update.
# my Newton-Raphson method
# set up an initial value
# this is sometimes crucial...
b = rep(0, ncol(x))
mybeta = my.logistic(b, x, y, tol = 1e-10, maxitr = 20,
gr = my.gradient, hess = my.hessian, verbose = TRUE)
## at iteration 1, current beta is
## -4.032 0.005 0.066 0.133 0.009 0.694 0.024 -0.045 -0.001 0.027
## at iteration 2, current beta is
## -5.684 0.006 0.077 0.167 0.017 0.884 0.037 -0.061 0 0.041
## at iteration 3, current beta is
## -6.127 0.007 0.079 0.174 0.019 0.924 0.039 -0.063 0 0.045
## at iteration 4, current beta is
## -6.151 0.007 0.079 0.174 0.019 0.925 0.04 -0.063 0 0.045
## at iteration 5, current beta is
## -6.151 0.007 0.079 0.174 0.019 0.925 0.04 -0.063 0 0.045
## at iteration 6, current beta is
## -6.151 0.007 0.079 0.174 0.019 0.925 0.04 -0.063 0 0.045
## at iteration 7, current beta is
## -6.151 0.007 0.079 0.174 0.019 0.925 0.04 -0.063 0 0.045
# the parameter value
mybeta
## [,1]
## intercept -6.1507208650
## sbp 0.0065040171
## tobacco 0.0793764457
## ldl 0.1739238981
## adiposity 0.0185865682
## famhist 0.9253704194
## typea 0.0395950250
## obesity -0.0629098693
## alcohol 0.0001216624
## age 0.0452253496
# get the standard error estimation
mysd = sqrt(diag(solve(-my.hessian(mybeta, x, y))))
With this solution, I can then get the standard errors and the p-value. You can check them with the glm()
function solution.
# my summary matrix
round(data.frame("beta" = mybeta, "sd" = mysd, "z" = mybeta/mysd,
"pvalue" = 2*(1-pnorm(abs(mybeta/mysd)))), dig=5)
## beta sd z pvalue
## intercept -6.15072 1.30826 -4.70145 0.00000
## sbp 0.00650 0.00573 1.13500 0.25637
## tobacco 0.07938 0.02660 2.98376 0.00285
## ldl 0.17392 0.05966 2.91517 0.00355
## adiposity 0.01859 0.02929 0.63458 0.52570
## famhist 0.92537 0.22789 4.06053 0.00005
## typea 0.03960 0.01232 3.21382 0.00131
## obesity -0.06291 0.04425 -1.42176 0.15509
## alcohol 0.00012 0.00448 0.02714 0.97835
## age 0.04523 0.01213 3.72846 0.00019
# check that with the glm fitting
round(summary(heart.full)$coef, dig=5)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.15072 1.30826 -4.70145 0.00000
## sbp 0.00650 0.00573 1.13500 0.25637
## tobacco 0.07938 0.02660 2.98376 0.00285
## ldl 0.17392 0.05966 2.91517 0.00355
## adiposity 0.01859 0.02929 0.63458 0.52570
## famhist 0.92537 0.22789 4.06053 0.00005
## typea 0.03960 0.01232 3.21382 0.00131
## obesity -0.06291 0.04425 -1.42176 0.15509
## alcohol 0.00012 0.00448 0.02714 0.97835
## age 0.04523 0.01213 3.72846 0.00019
10.6 Penalized Logistic Regression
Similar to a linear regression, we can also apply penalties to a logistic regression to address collinearity problems or select variables in a high-dimensional setting. For example, if we use the Lasso penalty, the objective function is
\[\sum_{i=1}^n \log \, p(y_i | x_i, \boldsymbol{\beta}) + \lambda |\boldsymbol{\beta}|_1\]
This can be done using the glmnet
package. Specifying family = "binomial"
will ensure that a logistic regression is used, even your y
is not a factor (but as numerical 0/1).
library(glmnet)
lasso.fit = cv.glmnet(x = data.matrix(SAheart[, 1:9]), y = SAheart[,10],
nfold = 10, family = "binomial")
plot(lasso.fit)
The procedure is essentially the same as in a linear regression. And we could obtain the estimated parameters by selecting the best \(\lambda\) value.