When we use a logistic regression, how do we know if the model fits well or not? In a linear regression, the R-square can be used, which means how much variance is explained by the model. However, there is no criteria in a logistic regression that can be interpreted in the same way. However, we can always look at the accuracy of our model fitting.
The first question we have to ask is, if we have a new subject, how to predict if the outcome is 0 or 1? Our first instinct would be: if the predicted probability is larger than 0.5, then we should predict this subject to 1, and 0 otherwise. Then we can look at the entire training dataset to see how many of them are predicted correctly.
heart = read.csv("processed_cleveland.csv")
heart$Y = as.factor(heart$num > 0)
logistic.fit <- glm(Y~age + sex + chol + trestbps + cp , data = heart, family = binomial)
summary(logistic.fit)
##
## Call:
## glm(formula = Y ~ age + sex + chol + trestbps + cp, family = binomial,
## data = heart)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2343 -0.8233 -0.2620 0.7760 2.6168
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.094704 1.753881 -6.896 5.35e-12 ***
## age 0.055552 0.017668 3.144 0.00167 **
## sex 1.935521 0.347511 5.570 2.55e-08 ***
## chol 0.004395 0.002734 1.608 0.10794
## trestbps 0.020838 0.008269 2.520 0.01174 *
## cp 1.141678 0.170203 6.708 1.98e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.98 on 302 degrees of freedom
## Residual deviance: 306.52 on 297 degrees of freedom
## AIC: 318.52
##
## Number of Fisher Scoring iterations: 5
# this line of code will predict the training data itself, without providing additional data
pred = predict(logistic.fit, type = "response")
table("Predicted" = (pred > 0.5), "Outcome" = heart$Y)
## Outcome
## Predicted FALSE TRUE
## FALSE 128 32
## TRUE 36 107
We can see that there are 128 + 107 = 235 subjects who predicted correctly out of 303 subjects. Hence the accuracy is 235 / 303 = 77.56%.
However, this does not always mean that the model is good. We have to always check what is the “baseline” accuracy, which is the marginal frequency of the outcome: 164 / 303 = 54.13%. Now, consider a situation where the outcome is a very rare disease, say only happen 10% of the time. Then if we predict all subjects to 0, then the accuracy would already be 90%. Can we have a criteria that is more robust in that case?
\(\quad\) | True 0 | True 1 |
---|---|---|
Predict 1 | False Positive (FP) | True Positive (TP) |
Predict 0 | True Negative (TN) | False Negative (FN) |
Now, if we use 0.5 as the cut off to predict a new subject, we can
calculate the corresponding sensitivity and specificity. However, its
not necessary that we need to use 0.5 (think about the case when 90% of
the samples are 0). If we alter this cut off value, we can get different
sensitivity and specificity values for each choice of the cut off value.
Then we can plot 1 - specificity
(false positive rate)
versus the sensitivity
(true positive rate). This is called
the ROC curve, and it can be calculated automatically. The closer this
curve to the top-left, the better performance this model is. A common
measure is the area under the ROC curve.
library(ROCR)
roc <- prediction(pred, heart$Y)
# calculates the ROC curve
perf <- performance(roc,"tpr","fpr")
plot(perf,colorize=TRUE)
# this computes the area under the curve
performance(roc, measure = "auc")@y.values[[1]]
## [1] 0.834664