Two-by-Two Contengency Table

Contingency tables are commonly used to summarize frequencies of two or more categorical variables. The are many situations where you can utilize a contingency table. For example,

  • When performing a (prospective) randomized clinical trial with two treatment arms for curing a disease, we may collect the following data. Please note that in this example, the total number in each row can be pre-defined by the researcher.
\(\quad\) Cured No Cured Total
Treatment 830 170 1000
Control 640 360 1000
Total 1470 530 2000
  • In a (retrospective) case control study for detecting cancer-associated genetic markers, we may collect the following data. Note that in this case, the total number in each column is pre-defined by the researcher.
\(\quad\) Have Cancer No Cancer Total
Genotype AA 690 436 1126
Genotype Aa/aa 310 564 874
Total 1000 1000 2000
  • When evaluating a medical test of Trisomy in an observational study, the following table is observed. From this table, we can calculate the sensitivity, specificity, etc., which can be used to evaluate this medical test.
\(\quad\) Trisomy No Trisomy Total
Test Positive 15 68 83
Test Negative 12 879 891
Total 27 947 974

In all of these examples, we may be interested in whether the condition (row) is associated with the outcome (column). However, be very careful that the design of these studies are different, which may not allow some quantities being calculated. In particular, we are interested in two quantities: the Relative Risk and the Odds Ratio.

Relative Risk

The relative risk is defined as the ratio of the probability of an outcome in an exposed group to the probability of an outcome in an unexposed group. It is also called Risk Ratio. And often times, RR is used.

\(\quad\) Event No Event Total
Exposed A B A+B
Unexposed C D C+D
Total A+C B+D A+B+C+D

The RR can be calculated as

\[\frac{A / (A+B)}{C / (C+D)}\] If RR is significantly different from 1 (the risks from both groups are the same), then we can conclude that the group is a significant factor. It is very important to know that RR is only valid for a prospective study, meaning that the samples in the exposed and unexposed groups are defined first, then the event are observed later. This is reasonable because otherwise, \(A / (A+B)\) cannot be interpreted as the probability of the event in a given group.

Let’s use our previous artificial data as an example. The two risks are 830 / 1000 and 640 / 1000, making RR 1.296875. We will use the R package “epitools”. Note that this is an example when you do not have the original data, but only the summary frequency table. Also, be very careful that when specifying the table using matrix() function, R requires the input column-wise.

  library(epitools)
  
  # we need to specify the data of the first column, then second column
  freqtable = matrix(c(830, 640, 170, 360),nrow = 2, ncol = 2)
  
  # this is for naming the items properly (not a necessary step)
  # the R function will automatically assign them some name if you leave them empty. 
  rownames(freqtable) = c("Treatment", "Control")
  colnames(freqtable) = c("Cured", "Not Cured")  
  
  freqtable
##           Cured Not Cured
## Treatment   830       170
## Control     640       360
  
  # use the risk ratio function
  riskratio(freqtable)
## $data
##           Cured Not Cured Total
## Treatment   830       170  1000
## Control     640       360  1000
## Total      1470       530  2000
## 
## $measure
##                         NA
## risk ratio with 95% C.I. estimate    lower    upper
##                Treatment 1.000000       NA       NA
##                Control   2.117647 1.804627 2.484962
## 
## $p.value
##            NA
## two-sided   midp.exact fisher.exact   chi.square
##   Treatment         NA           NA           NA
##   Control            0 4.597469e-22 6.175089e-22
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"

To understand the results:

  • data simply restate the data and the totals for rows and columns
  • measure provides the risk ratio calculation. the estimate column is the estimated RR, with lower and upper as the 95% confidence interval. Note that the result uses the first row as a reference group. Hence, the the second row Control represent the RR of control vs. treatment. And we can see the estimated RR is 2.118 with a confidence interval (1.8046, 2.4849). Since this interval does not include, we know that the risks are significantly different. But we noticed that this is different from our own calculation 1.296875. This is because the function requires the Event to be specified in the second column of the data. Based on the construction of freqtable, the risks (of uncured) are 360 / 1000 and 170 / 1000, making the risk ratio 2.118.
  • p.value provides significance from three different test statistics: mid-p, fisher's exact and chi square. In our case, since the sample size is very large, they should provide similar results. When the sample size is very small, fisher's exact should be used.

Now we re-organize the freqtable so that it is properly orientated to calculate the quantities we are interested in. Now this results is the calculating the RR of the event, and it is the RR of the treatment vs. control instead of the other way around. The conclusion of significance will not be any different since it is essentially the same data.

  # switch columns
  freqtable = freqtable[, c(2,1)]
  
  # switch rows 
  freqtable = freqtable[c(2,1),]
  
  riskratio(freqtable)
## $data
##           Not Cured Cured Total
## Control         360   640  1000
## Treatment       170   830  1000
## Total           530  1470  2000
## 
## $measure
##                         NA
## risk ratio with 95% C.I. estimate    lower    upper
##                Control   1.000000       NA       NA
##                Treatment 1.296875 1.228342 1.369231
## 
## $p.value
##            NA
## two-sided   midp.exact fisher.exact   chi.square
##   Control           NA           NA           NA
##   Treatment          0 4.597469e-22 6.175089e-22
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"

Example: Vaccine Effect Size Calculation

A chickenpox outbreak started in an Oregon elementary school in October 2001. Tugwell et al. (2021) investigated students who were at the risk of chickenpox prior the event and separated the subjects into vaccinated and unvaccinated groups. This study data were also used in the CDC Principles of Epidemiology Guild. The following data were observed:

\(\quad\) Varicella Non-case
Vaccinated 18 134
Unvaccinated 3 4

What is the effectiveness of the vaccine? Note that the definition of the effect size of a vaccine is defined as 1 - RR. Hence, let’s focus on calculating the RR and its confidence interval and significance. We need to convert this to the R data:

  # Please be careful about the table construction
  Chickenpox = matrix(c(4, 134, 3, 18),nrow = 2, ncol = 2)
  rownames(Chickenpox) = c("Unvaccinated", "Vaccinated")
  colnames(Chickenpox) = c("Non-case", "Varicella")  

  # use the risk ratio function
  riskratio(Chickenpox)
## Warning in chisq.test(xx, correct = correction): Chi-squared approximation may be incorrect
## $data
##              Non-case Varicella Total
## Unvaccinated        4         3     7
## Vaccinated        134        18   152
## Total             138        21   159
## 
## $measure
##                         NA
## risk ratio with 95% C.I.  estimate    lower     upper
##             Unvaccinated 1.0000000       NA        NA
##             Vaccinated   0.2763158 0.105896 0.7209945
## 
## $p.value
##               NA
## two-sided      midp.exact fisher.exact chi.square
##   Unvaccinated         NA           NA         NA
##   Vaccinated   0.05554613   0.04934509 0.01780275
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"

Hence the estimated RR is 27.6%, with a 95% confidence interval of (10.6%, 72.1%). This is significantly different from 1, with p-value 0.049. Note that, we use the Fisher’s exact test since the sample size is relatively small. The effect size of the vaccine is 1 - 27.6% = 72.4% with confidence interval (27.9%, 89.4%).

Practice Questions

  1. Another example was also presented at the CDC website regarding the incidence of Mycobacterium tuberculosis infection among congregated, HIV-Infected prison inmates in South Carolina, United States. We will use the data from the original article instead of the table on the CDC website. From Table 2 in McLaughlin (2003), a total of 233 subjects were included, and use the information regarding their side of dormitory:
\(\quad\) Infected Not Infected
Right Dormitory 82 36
Left Dormitory 22 93

From this table, replicate their results of RR (right against left) and confidence interval.

  # Please be careful about the table construction
  MT = matrix(c(93, 36, 22, 82),nrow = 2, ncol = 2)
  rownames(MT) = c("Left", "Right")
  colnames(MT) = c("Non-Infected", "Infected")

  # use the risk ratio function
  riskratio(MT)
  1. When we actually have the individual data, all we need to do is to summarize the data into the 2 by 2 table. This can be done using the table() function. Try the RR calculation using the following data. However, be careful about your interpretation of the result due to the table orientation.
  newdata = data.frame("Infected" = rbinom(100, 1, prob = 0.3), 
                       "Vaccinated" = rbinom(100, 1, prob = 0.5))
  datatable = table("Vaccinated"= newdata$Vaccinated, "Infected" = newdata$Infected)
  riskratio(datatable)