\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\cA}{\mathcal{A}}\) \(\newcommand{\cB}{\mathcal{B}}\) \(\newcommand{\cC}{\mathcal{C}}\) \(\newcommand{\cD}{\mathcal{D}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cF}{\mathcal{F}}\) \(\newcommand{\cG}{\mathcal{G}}\) \(\newcommand{\cH}{\mathcal{H}}\) \(\newcommand{\cI}{\mathcal{I}}\) \(\newcommand{\cJ}{\mathcal{J}}\) \(\newcommand{\cK}{\mathcal{K}}\) \(\newcommand{\cL}{\mathcal{L}}\) \(\newcommand{\cM}{\mathcal{M}}\) \(\newcommand{\cN}{\mathcal{N}}\) \(\newcommand{\cO}{\mathcal{O}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\cQ}{\mathcal{Q}}\) \(\newcommand{\cR}{\mathcal{R}}\) \(\newcommand{\cS}{\mathcal{S}}\) \(\newcommand{\cT}{\mathcal{T}}\) \(\newcommand{\cU}{\mathcal{U}}\) \(\newcommand{\cV}{\mathcal{V}}\) \(\newcommand{\cW}{\mathcal{W}}\) \(\newcommand{\cX}{\mathcal{X}}\) \(\newcommand{\cY}{\mathcal{Y}}\) \(\newcommand{\cZ}{\mathcal{Z}}\) \(\newcommand{\bA}{\mathbf{A}}\) \(\newcommand{\bB}{\mathbf{B}}\) \(\newcommand{\bC}{\mathbf{C}}\) \(\newcommand{\bD}{\mathbf{D}}\) \(\newcommand{\bE}{\mathbf{E}}\) \(\newcommand{\bF}{\mathbf{F}}\) \(\newcommand{\bG}{\mathbf{G}}\) \(\newcommand{\bH}{\mathbf{H}}\) \(\newcommand{\bI}{\mathbf{I}}\) \(\newcommand{\bJ}{\mathbf{J}}\) \(\newcommand{\bK}{\mathbf{K}}\) \(\newcommand{\bL}{\mathbf{L}}\) \(\newcommand{\bM}{\mathbf{M}}\) \(\newcommand{\bN}{\mathbf{N}}\) \(\newcommand{\bO}{\mathbf{O}}\) \(\newcommand{\bP}{\mathbf{P}}\) \(\newcommand{\bQ}{\mathbf{Q}}\) \(\newcommand{\bR}{\mathbf{R}}\) \(\newcommand{\bS}{\mathbf{S}}\) \(\newcommand{\bT}{\mathbf{T}}\) \(\newcommand{\bU}{\mathbf{U}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\ba}{\mathbf{a}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\bd}{\mathbf{d}}\) \(\newcommand{\be}{\mathbf{e}}\) \(\newcommand{\bg}{\mathbf{g}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bi}{\mathbf{i}}\) \(\newcommand{\bj}{\mathbf{j}}\) \(\newcommand{\bk}{\mathbf{k}}\) \(\newcommand{\bl}{\mathbf{l}}\) \(\newcommand{\bm}{\mathbf{m}}\) \(\newcommand{\bn}{\mathbf{n}}\) \(\newcommand{\bo}{\mathbf{o}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bq}{\mathbf{q}}\) \(\newcommand{\br}{\mathbf{r}}\) \(\newcommand{\bs}{\mathbf{s}}\) \(\newcommand{\bt}{\mathbf{t}}\) \(\newcommand{\bu}{\mathbf{u}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\bxi}{\boldsymbol{\xi}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bzero}{\mathbf{0}}\) \(\newcommand{\T}{\text{T}}\) \(\newcommand{\Trace}{\text{Trace}}\) \(\newcommand{\Cov}{\text{Cov}}\) \(\newcommand{\Var}{\text{Var}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\pr}{\text{Pr}}\) \(\newcommand{\p}{\text{p}}\) \(\newcommand{\Prob}{\text{P}}\) \(\newcommand{\argmin}{\operatorname*{arg\,min}}\) \(\newcommand{\argmax}{\operatorname*{arg\,max}}\)

Overview

As we discussed previously, when the treatment assignment is not independent of the potential outcomes, the difference-in-means estimator can be biased. In this section, we will discuss the propensity score and the inverse probability weighting (IPW) method to adjust for this bias.

Example: Simpson’s Paradox

Suppose we split the population into two subgroups, say those with age \(\leq 50\) and those with age \(> 50\). We are interested in the effect of a new treatment on the recovery from a disease. For each group, we perform a randomized trial, but with different probabilities. The treatment is binary, and the outcome is whether the patient recovered from the diseases. Hence, the treatment effect is the the difference in the probabilities of observing a recovering (1). We are interested in the average treatment effect (ATE) on the entire population. Here is the data we observed. The numbers in each cell represent the

Group / Treatment Treatment 1 Treatment 0
Age \(\leq 50\) 18 out of 20 (90%) 150 out of 180 (83.3%)
Age \(> 50\) 50 out of 80 (62.5%) 10 out of 20 (50%)
Pooled 68 out of 100 (68%) 160 out of 200 (80%)

We can see that the treatment effect is positive (high probability to recover) in both groups, but negative (worse than control) overall. This is known as the Simpson’s paradox. Even we used a randomized control trial in both groups, the difference-in-means estimator is biased. Which assumption in our previous lecture is violated?

  • The random treatment assignment assumption is violated in this case. It is not completely independent of the potential outcomes. In group A, the potential outcomes are actually higher (it is overall more likely to recover), but the treatment assignment is more likely to be 0. While in group B, the potential outcomes are lower, but the treatment assignment is more likely to be 1.

Although this dependency may not be in purpose, it is not uncommon in practice. If we know this ahead of time, we could consider estimating the average treatment effect (ATE) separately in each group, where the chance of treatment assignment remains constant. And then we can average the effect in each group, based on their respective sample size. In this case, the estimated ATE would be:

\[ \begin{aligned} \tau &= \frac{200}{300} \times \left(\frac{18}{20} - \frac{150}{180}\right) + \frac{100}{300} \times \left(\frac{50}{80} - \frac{10}{20}\right) \\ & \approx 8.63\% \end{aligned} \]

A Different View

Let’s consider a different view of the same formula given above. Let’s group the two parts associated with treatment 1 as

\[ \begin{aligned} & \frac{200}{300} \times \frac{18}{20} + \frac{100}{300} \times \frac{50}{80} \\ =& \frac{1}{300} \times \left( \frac{18}{20 / 200} + \frac{50}{80 / 100} \right) \end{aligned} \]

which effectively becomes this following formula:

\[ \frac{1}{n} \left( \sum_{i = 1}^{n_1} \frac{A_i Y_i}{ \widehat \Pr(A_i = 1| \text{Group 1})} + \sum_{i = n_1+1}^{n} \frac{A_i Y_i}{ \widehat \Pr(A_i = 1| \text{Group 2})} \right) \]

If we further define the group label as covariate \(X_i\), where \(X_i = 1\) if the patient is in group 1, and \(X_i = 0\) if the patient is in group 2. Then the above formula can be written as

\[ \frac{1}{n} \sum_{i = 1}^{n} \frac{A_i Y_i}{ \Pr(A_i = 1| X_i )} \]

And similarly, we can write the part associated with treatment 0, and combine the two as an estimator of the ATE:

\[ \widehat\tau_{\text{ipw}} = \frac{1}{n} \sum_{i = 1}^{n} \left( \frac{A_i Y_i}{ \widehat e(X_i) } - \frac{(1 - A_i) Y_i}{ 1 - \widehat e(X_i) } \right) \]

where \(e(X_i) = \Pr(A_i = 1| X_i )\) is the probability of receiving treatment given the covariate \(X_i\). \(e(X_i)\) is also known as the propensity score.

Compared with the difference-in-means estimator, the above formula is a weighted average estimator, with each observation weighed by the chance of receiving their respective treatment label so that the sum scales back to the total of \(n\) observations. This technique is known as the inverse propensity weighting (IPW). The estimator was originally proposed by Horvitz and Thompson (1952) in survey sampling and then used in causal inference.

Assumptions

Recall that our ultimate goal is to estimate the average treatment effect (ATE), following the potential outcome notation:

\[ \tau = \E[Y_i(1) - Y_i(0)] \]

Our goal here is to establish some properties of the new estimator \(\widehat\tau_{\text IPW}\). To do so, we need to establish some assumptions. Previously, we had two assumptions, Independence and SUTVA. It is obvious that the independence assumption is no longer valid. But some new assumptions could help us to establish the validity of the IPW estimator, besides the SUTVA assumption.

Unconfoundedness: \(A_i \perp \{Y_i(0), Y_i(1)\} | X_i\)

Essentially this is the same as the independence assumption, but now we condition on the covariate \(X_i\). Why do we call it unconfoundedness? Because this eliminates the potential bias in our estimation if we do not conditioning on some variable. Here, conditioning on \(X_i\) is sufficient. The assumption has many different names, such as strong ignorability or conditional independence. There is an even weaker version of this called ignorability, readers are referred to (Rubin 1978) and (Ding 2023) for more details. The difference is rather technical and is not the focus of this lecture.

Positivity: The treatment assignment is not deterministic for any value of \(X\), i.e., \(\eta \leq P(A = 1 | X = x) \doteq e(x) \leq 1 - \eta\)

This is to ensure that we have individuals from both treatment labels for each value of \(X\) so that we can compare them. This may seem to be pretty natural, but when the cardinality of \(X\) is large, this could be violated.

Inverse-propensity Weighting

Because estimating \(\widehat e(X_i)\) comes with additional randomness, and could potentially harm the performance of the estimator, let’s first consider a version with \(e(X_i)\) known (e.g., by design). Then the estimator becomes

\[ \widehat\tau_\text{ipw}^\ast = \frac{1}{n} \sum_{i = 1}^{n} \left( \frac{A_i Y_i}{ e(X_i) } - \frac{(1 - A_i) Y_i}{ 1 - e(X_i) } \right) \]

First, we can see that this estimator is an unbiased estimator of the ATE:

\[ \begin{aligned} \E[ \widehat\tau_\text{ipw}^\ast ] =& \E \left[ \frac{A_i Y_i}{ e(X_i) } - \frac{(1 - A_i) Y_i}{ 1 - e(X_i) } \right] \quad \text{by IID} \\ =& \E \left[ \frac{A_i Y_i(1)}{ e(X_i) } - \frac{(1 - A_i) Y_i(0)}{ 1 - e(X_i)} \right] \quad \text{by SUTVA} \\ =& \E \left[ \E \left[ \frac{A_i Y_i(1)}{ e(X_i) } \Biggm| X_i \right] - \E \left[ \frac{(1 - A_i) Y_i(0)}{ 1 - e(X_i)} \Biggm| X_i \right] \right] \\ =& \E \left[ \frac{e(X_i)}{ e(X_i) } \E[ Y_i(1) | X_i ] - \frac{1 - e(X_i)}{1- e(X_i) } \E [Y_i(0) | X_i ] \right] \quad \text{by Unconfoundedness} \\ =& \E \, [ Y_i(1) - Y_i(0) ] \\ \end{aligned} \]

Now, to bridge the gap between the known \(e(X_i)\) and the estimated \(\widehat e(X_i)\), we can further assume that

  • \(|Y_i| \leq M\)
  • \(\sup_{x} \left| e(x) - \widehat e(x) \right| \rightarrow 0\)

Then, it should be easy to see that \(\left| \widehat\tau_\text{ipw}^\ast - \widehat\tau_\text{ipw} \right|\) also converges to 0, at the same rate of \(\widehat e(x)\) converges to \(e(x)\). Furthermore, we can see that \(\widehat\tau_\text{ipw}^\ast\) is an average of iid terms, which will enjoy \(\sqrt{n}\) rate of consistency. Hence \(\widehat\tau_\text{ipw}\) will also be consistent. 1

Example 1: Direct Implementation

The lalonde dataset in the Maching package is a popular dataset for causal effect. It contains the following variables:

  • treat: a binary variable indicating whether the subject received the treatment
  • age: the age of the subject
  • educ: the education level of the subject
  • black: a binary variable indicating whether the subject is black
  • hisp: a binary variable indicating whether the subject is Hispanic
  • married: a binary variable indicating whether the subject is married
  • nodegr: a binary variable indicating whether the subject has a degree
  • re74: the real earnings in 1974
  • re75: the real earnings in 1975
  • re78: the real earnings in 1978

LaLonde (1986) was interested in the causal effect of a job training program on earnings. This is in fact an experimental data, meaning that the participants were randomly assigned to the treatment and control groups. However, we can still use this dataset to illustrate the IPW method. If we consider a naive difference-in-means estimator, we would have:

  library(Matching)
## Loading required package: MASS
## ## 
## ##  Matching (Version 4.10-14, Build Date: 2023-09-13)
## ##  See https://www.jsekhon.com for additional documentation.
## ##  Please cite software as:
## ##   Jasjeet S. Sekhon. 2011. ``Multivariate and Propensity Score Matching
## ##   Software with Automated Balance Optimization: The Matching package for R.''
## ##   Journal of Statistical Software, 42(7): 1-52. 
## ##
  data(lalonde)
  head(lalonde)
  
  # Calculate the average earnings for treated and control groups
  avg_treated <- mean(lalonde$re78[lalonde$treat == 1])
  avg_control <- mean(lalonde$re78[lalonde$treat == 0])
  
  # Calculate the difference in means
  ate_diff_means <- avg_treated - avg_control
  
  ate_diff_means
## [1] 1794.343

As we have learned in this lecture, we would like to use the IPW estimator to adjust for potential bias.

  ps_model <- glm(treat ~ age + educ + black + hisp + married + nodegr + re74 + re75, 
                  family = binomial(link = "logit"), data = lalonde)
  
  # Extract the propensity scores
  lalonde$pscore <- predict(ps_model, type = "response")
  lalonde$psweight <- ifelse(lalonde$treat == 1, 1 / lalonde$pscore, 1 / (1 - lalonde$pscore))
  
  mean(lalonde$treat*lalonde$re78*lalonde$psweight - (1 - lalonde$treat)*lalonde$re78*lalonde$psweight)
## [1] 1613.135

Example 2: Propensity Score Stratification

An interesting fact is that we do not need to know the full information of \(X\) to satisfy the conditional independence assumption. We only need to know the propensity score \(e(X)\). Formally, this means that

\[ \text{If} \quad A_i \perp \{Y_i(0), Y_i(1)\} | X_i \quad \text{then} \quad A_i \perp \{Y_i(0), Y_i(1)\} | e(X_i) \]

The proof is relatively straightforward, by showing that the conditional distribution of \(A_i\) given \(e(X_i)\), \(Y_i(1)\) and \(Y_i(0)\) is the same as the conditional distribution of \(A_i\) given \(e(X_i)\). And this leads to a new idea: we can first fit a model to estimate the propensity score, and then stratify the data based on the estimated propensity score. Then we can estimate the ATE within each stratum, and then average the stratum-specific ATEs to obtain the overall ATE. This is known as the propensity score stratification method. The benefit of this compared with the IPW method is that the propensity score does not need to be vary accuarte, but we only need their rankings to be accurate (you may also argue that this is a difficult task).

There is a natural trade-off here caused by the number of strata. If we have too many strata, then the variance of the estimator will be large. If we have too few strata, then the bias of the estimator will be large. The following code shows how to implement the method with the lalonde dataset. We use an adhoc choice of 7 strata, which ensures a good number of observations in each stratum.

  lalonde$pscore_strata <- cut(lalonde$pscore, breaks = 7)
  lalonde$pscore_strata <- as.numeric(lalonde$pscore_strata)
  
  # Estimate the ATE within each stratum
  ate_strata <- tapply(lalonde$re78, lalonde$pscore_strata, function(x) mean(x[lalonde$treat == 1]) - mean(x[lalonde$treat == 0]))
  
  means1_within_strata <- aggregate(re78 ~ pscore_strata, data = lalonde[lalonde$treat == 1,], FUN = mean)
  means0_within_strata <- aggregate(re78 ~ pscore_strata, data = lalonde[lalonde$treat == 0,], FUN = mean)
  
  cbind('strata' = means1_within_strata$pscore_strata, 
        'treated' = means1_within_strata$re78,
        'control' = means0_within_strata$re78,
        'strata size' = table(lalonde$pscore_strata))
##   strata  treated  control strata size
## 1      1 8564.331 6020.404          28
## 2      2 3416.432 6817.987          17
## 3      3 5180.502 4085.232         206
## 4      4 6631.739 4517.497          79
## 5      5 6518.722 5429.714          31
## 6      6 8019.782 3603.864          66
## 7      7 7040.979 7132.533          18
  
  sum((means1_within_strata$re78 - means0_within_strata$re78)* table(lalonde$pscore_strata)) / nrow(lalonde)
## [1] 1639.586

Example 3: Maching and the Average Treatment Effect of the Treated

Up to now, we have not really discussed the population of the subjects that we are estimating the treatment effect. In the previous examples, we are estimating the average treatment effect (ATE) on the entire population. Keep in mind that our observations are sampled randomly from the population, hence we need to calculate the treatment effect on all subject including those who received and did not receive the treatment. However, in some cases this may not be the interest of the researchers. And instead, we are only interested in those who are actually received the treatment. This is known as the average treatment effect of the treated (ATT). Formally, our target of interest is

\[ \tau_{\text{ATT}} = \E[Y_i(1) - Y_i(0) | A_i = 1] \]

Of course we can use the same estimator as defined previously but only restrict the sample to those who received the treatment. But here let’s introduce a new method called matching. The idea is to match each treated subject with a control subject based on their estimated propensity score \(e(X_i)\). Then the difference of each of such matched pair can be viewed as an estimate of the treatment effect given the \(e(X_i)\). Recall our propensity score stratification method, conditioning on \(e(X_i)\) is sufficient to disentangle the treatment effect from the treatment assignment. In general, we can also find multiple matched control subjects for each treated subject to improve the stability, and then average the differences to obtain the overall treatment effect. You can see that as we get more matched samples, this becomes similar to the stratification method, in which each stratum only contains a few subject with similar propensity scores. The following code is a direct implementation of the method with the lalonde dataset.

  matching_control = data.frame(pscore = lalonde$pscore[lalonde$treat == 0], 
                                outcome = lalonde$re78[lalonde$treat == 0])

  matching_treat = data.frame(pscore = lalonde$pscore[lalonde$treat == 1], 
                              outcome = lalonde$re78[lalonde$treat == 1])
  
  ntreat = nrow(matching_treat)
  matched_outcome = rep(NA, ntreat)
  
  M = 5
  
  # match each subject
  for (i in 1:ntreat) 
  {
    abs_diffs = abs(matching_treat$pscore[i] - matching_control$pscore)
    idmatch = order(abs_diffs)[1:M]
    matched_outcome[i] = mean(matching_control$outcome[idmatch])
  }
  
  mean(matching_treat$outcome - matched_outcome)
## [1] 2246.787

We can see that the estimated ATT is much larger than the ATE. This is because we are not focusing on the entire population, but only those who received the treatment. In practice, it is possible that the treatment is only accessible to a small portion of the population, and those who use them are likely to have a higher chance of benefiting from the treatment.

Reference

Ding, Peng. 2023. “A First Course in Causal Inference.” arXiv Preprint arXiv:2305.18793.
Horvitz, Daniel G, and Donovan J Thompson. 1952. “A Generalization of Sampling Without Replacement from a Finite Universe.” Journal of the American Statistical Association 47 (260): 663–85.
Rubin, Donald B. 1978. “Bayesian Inference for Causal Effects: The Role of Randomization.” The Annals of Statistics, 34–58.

  1. An interesting result we will see in the homework is that \(\widehat\tau_{\text IPW}\) with estimated propensity score will actually perform better than \(\widehat\tau_{\text IPW}^\ast\), which directly use the true propensity score. Intuitively, we can understand this as the estimated propensity score accounts for the uncertainties of the total subjects that actually received the treatment/control.↩︎