A motivating example. We use use the Cleveland
clinic heart disease dataset from Week 3 again. The goal this time
is to model and predict a class label of whether the patient has a
hearth disease or not. This is indicated by whether the num
variable is 0 (no presence) or >0 (presence).
heart = read.csv("processed_cleveland.csv")
heart$Y = as.factor(heart$num > 0)
table(heart$Y)
##
## FALSE TRUE
## 164 139
To model the 1/0 outcome, we need to set its connection with covariates. Recall in a linear regression, the outcome is continuous, and we can set
\[Y = \beta_0 + \beta_1 X + \epsilon\] However, this does not work for classification since \(Y\) can only be 0 or 1. Hence we turn to consider modeling the probability \(P(Y = 1 | X)\). This notation is interpreted as the probability of \(Y\) being 1 (e.g., the probability of have heart disease). And this is modeled by a function of \(X\):
\[ P(Y = 1 | X) = \frac{\exp(\beta_0 + \beta_1 X)}{1 + \exp(\beta_0 + \beta_1 X)}\] This can be a tricky idea for beginners. However, here are some facts:
Let’s model the probability of heart disease using the
Age
variable. This can be done using the glm()
function, which stands for Generalized Linear Model. The syntax of
glm()
is almost the same as a linear model. Note that it is
important to use family = binomial
to specify the logistic
regression.
logistic.fit <- glm(Y~age, data = heart, family = binomial)
summary(logistic.fit)
##
## Call:
## glm(formula = Y ~ age, family = binomial, data = heart)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.596 -1.073 -0.835 1.173 1.705
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.00591 0.75913 -3.960 7.5e-05 ***
## age 0.05199 0.01367 3.803 0.000143 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.98 on 302 degrees of freedom
## Residual deviance: 402.54 on 301 degrees of freedom
## AIC: 406.54
##
## Number of Fisher Scoring iterations: 4
The result is similar to a linear regression, with some differences:
Let’s consider a new subject with Age = 55
. What is the
predicted probability of heart disease? Based on our formula, we
have
\[\beta_0 + \beta_1 X = -3.00591 + 0.05199 \times 55 = -0.14646\] And the estimated probability is
\[ P(Y = 1 | X) = \frac{\exp(\beta_0 +
\beta_1 X)}{1 + \exp(\beta_0 + \beta_1 X)} = \frac{\exp(-0.14646)}{1 +
\exp(-0.14646)} = 0.4634503\] Hence, the estimated probability
for this subject is 46.3%. This can be done using R code. Please note
that if you want to predict the probability, you need to specify
type = "response"
. Otherwise, only \(\beta_0 + \beta_1 X\) is provided.
testdata = data.frame("age" = 55)
predict(logistic.fit, newdata = testdata)
## 1
## -0.1466722
predict(logistic.fit, newdata = testdata, type = "response")
## 1
## 0.4633975
Remark: Although our estimation shows that the predicted probability is 46%, but this doesn’t mean that the probability of heart disease of a 55 year old person would be around 46%. This is because we do not have the data from the entire population. Our data is collected from the ones that were referred for coronary arteriography (Detrano et al, (1984)). Hence we are analyzing a sub-population of high risk individuals! Hence the validity of the predicted probability relies heavily on the quality of the data and the study design. This is the same the running a linear regression.
Recall that we learned the Odds Ratio from a contingency table. Let’s consider calculating the odds based on the probability of \(Y = 1\) given \(X\). Let’s denote this as \(p\), then we have
\[ \begin{aligned} p &= \frac{\exp(\beta_0 + \beta_1 X)}{1 + \exp(\beta_0 + \beta_1 X)}\\ 1 - p &= \frac{1}{1 + \exp(\beta_0 + \beta_1 X)} \\ \text{Odds} = \frac{p}{1-p} &= \exp(\beta_0 + \beta_1 X) \end{aligned} \] Now, how about odds ratio? That would require us to consider two groups. Let’s define the following two, with some arbitrary age value \(a\):
age
\(=
a\)age
\(= a +
1\)By the definition of the odds ratio, we would have
\[ \begin{aligned} \text{Odds Ratio} &= \frac{\text{Odds in Group 2}}{\text{Odds in Group 1}}\\ &= \frac{\exp(\beta_0 + \beta_1 (a+1))}{\exp(\beta_0 + \beta_1 a)}\\ &= \frac{\exp(\beta_0 + \beta_1 a) \times \exp(\beta_1)}{\exp(\beta_0 + \beta_1 a)}\\ &= \exp(\beta_1) \end{aligned} \] Taking log on both sides, we have
\[\log(\text{Odds Ratio}) = \beta_1\]
Hence, the odds ratio between these two groups (they differ
only with one unit of age
) can be directly
interpreted as the exponential of the parameter of age
.
After taking the log, we can also say that
The parameter \(\beta\) of a varaible in a logistic regression represents the log of odds ratio associated with this variable.
Please note that we usually do not be explicit about what this odds ratio is about (what two groups is it comparing). However, based on our previous calculation, this odds ratio is referring to the two groups defined with one unit different in terms of this variable that we are interested in, and identical in terms of all other variables.
Similar to the examples we showed in the regression session,
categorical variables are also commonly used in a logistic regression.
Furthermore, it is very common to use an interaction term of these
variables. The following example considers two binary predictors:
sex
and fbs
(fasting blood sugar > 120),
and their interactions.
logistic.fit <- glm(Y~ as.factor(sex) + as.factor(fbs) + as.factor(I(sex*fbs)), data = heart, family = binomial)
summary(logistic.fit)
##
## Call:
## glm(formula = Y ~ as.factor(sex) + as.factor(fbs) + as.factor(I(sex *
## fbs)), family = binomial, data = heart)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2929 -1.1774 -0.7113 1.0661 1.7310
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2452 0.2604 -4.783 1.73e-06 ***
## as.factor(sex)1 1.5127 0.3022 5.006 5.57e-07 ***
## as.factor(fbs)1 1.2452 0.6333 1.966 0.0493 *
## as.factor(I(sex * fbs))1 -1.5733 0.7389 -2.129 0.0332 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.98 on 302 degrees of freedom
## Residual deviance: 389.44 on 299 degrees of freedom
## AIC: 397.44
##
## Number of Fisher Scoring iterations: 4
How to interpret the results? We have to calculate the values associated with each setting:
sex
= 0 and fbs
= 0, then their
product is also 0. Hence the predicted link (\(\beta X\) part) is only the intercept term
-1.2452. And the predicted probability is \(\frac{\exp(-1.2452)}{1 + \exp(-1.2452)} =
0.2235321\).sex
= 1 and fbs
= 0, then their product
is 0. The predicted link is -1.2452 + 1.5127 = 0.2675 (intercept +
effect of sex
). And the predicted probability is \(\frac{\exp(0.2675)}{1 + \exp(0.2675)} =
0.5664791\).sex
= 0 and fbs
= 1, then their product
is still 0. The predicted link is -1.2452 + 1.2452 = 0(intercept +
effect of fbs
). And the predicted probability is 0.5.sex
and fbs
are 1, their
product is 1. The predicted link is -1.2452 + 1.5127 + 1.2452 -1.5733 =
-0.0606. And the predicted probability is 0.4848546We can validate the result using the following code:
newdata = data.frame("sex" = c(0, 1, 0, 1), "fbs" = c(0, 0, 1, 1))
predict(logistic.fit, newdata = newdata, type = "response")
## 1 2 3 4
## 0.2235294 0.5664740 0.5000000 0.4848485
For continuous variables and their iterations, we can follow the same logic to calculate the predicted probability. Try the practice question yourself.
O’Brien
et al. (2020) is a psychology study that analyze the participant’s
behavior of disseminating false claims (pseudoscience). This is
formatted as a binary outcome variable, with value 1 as disseminating a
false claims. Furthermore, the authors considered variables such as
Trust
(trust in science, ranging from 1 to 5, see section
5.3), Literacy
is a continuous variable that reflex the
understanding of scientific methodology (ranging from 0 to 8, see
section 5.3), scientific content is an binary variable that indicate if
the article provided to the participant contains scientific information.
Read Table 1 and experiment 1 (the results in the first 8 rows), and the
reduced model (four columns on the right hand side). The column
B
contains the parameter estimate.
Trust
= 4,
Literacy
= 6 and is reading a material with scientific
content. What is the probability that this participant disseminate this
false information?