Discussion: Evaluation of Classification Models

When we use a logistic regression, how do we know if the model fits well or not? In a linear regression, the R-square can be used, which means how much variance is explained by the model. However, there is no criteria in a logistic regression that can be interpreted in the same way. However, we can always look at the accuracy of our model fitting.

The first question we have to ask is, if we have a new subject, how to predict if the outcome is 0 or 1? Our first instinct would be: if the predicted probability is larger than 0.5, then we should predict this subject to 1, and 0 otherwise. Then we can look at the entire training dataset to see how many of them are predicted correctly.

  heart = read.csv("processed_cleveland.csv")
  heart$Y = as.factor(heart$num > 0)
  logistic.fit <- glm(Y~age + sex + chol + trestbps + cp , data = heart, family = binomial)
  summary(logistic.fit)
## 
## Call:
## glm(formula = Y ~ age + sex + chol + trestbps + cp, family = binomial, 
##     data = heart)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2343  -0.8233  -0.2620   0.7760   2.6168  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -12.094704   1.753881  -6.896 5.35e-12 ***
## age           0.055552   0.017668   3.144  0.00167 ** 
## sex           1.935521   0.347511   5.570 2.55e-08 ***
## chol          0.004395   0.002734   1.608  0.10794    
## trestbps      0.020838   0.008269   2.520  0.01174 *  
## cp            1.141678   0.170203   6.708 1.98e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 417.98  on 302  degrees of freedom
## Residual deviance: 306.52  on 297  degrees of freedom
## AIC: 318.52
## 
## Number of Fisher Scoring iterations: 5
  # this line of code will predict the training data itself, without providing additional data
  pred = predict(logistic.fit, type = "response")
  table("Predicted" = (pred > 0.5), "Outcome" = heart$Y)
##          Outcome
## Predicted FALSE TRUE
##     FALSE   128   32
##     TRUE     36  107

We can see that there are 128 + 107 = 235 subjects who predicted correctly out of 303 subjects. Hence the accuracy is 235 / 303 = 77.56%.

However, this does not always mean that the model is good. We have to always check what is the “baseline” accuracy, which is the marginal frequency of the outcome: 164 / 303 = 54.13%. Now, consider a situation where the outcome is a very rare disease, say only happen 10% of the time. Then if we predict all subjects to 0, then the accuracy would already be 90%. Can we have a criteria that is more robust in that case?

Sensitivity, Specificity and the ROC Curve

\(\quad\) True 0 True 1
Predict 1 False Positive (FP) True Positive (TP)
Predict 0 True Negative (TN) False Negative (FN)
  • Sensitivity (also called “Recall”) is the defined as the true positive rate (among the True 1 population, what proportion are correctly identified by the model) \[\text{Sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}}\]
  • Specificity is the defined as the true negative rate (among the True 1 population, what proportion are correctly identified by the model) \[\text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}}\]

Now, if we use 0.5 as the cut off to predict a new subject, we can calculate the corresponding sensitivity and specificity. However, its not necessary that we need to use 0.5 (think about the case when 90% of the samples are 0). If we alter this cut off value, we can get different sensitivity and specificity values for each choice of the cut off value. Then we can plot 1 - specificity (false positive rate) versus the sensitivity (true positive rate). This is called the ROC curve, and it can be calculated automatically. The closer this curve to the top-left, the better performance this model is. A common measure is the area under the ROC curve.

  library(ROCR)
  roc <- prediction(pred, heart$Y)
  
  # calculates the ROC curve
  perf <- performance(roc,"tpr","fpr")
  plot(perf,colorize=TRUE)

  # this computes the area under the curve
  performance(roc, measure = "auc")@y.values[[1]]
## [1] 0.834664