Propensity Score and Inverse Probability Weighting

1 Overview

As we discussed previously, when the treatment assignment is not independent of the potential outcomes, the difference-in-means estimator can be biased. In this section, we will discuss the propensity score and the inverse probability weighting (IPW) method to adjust for this bias.

2 Example: Simpson’s Paradox

Suppose we split the population into two subgroups, say those with age \(\leq 50\) and those with age \(> 50\). We are interested in the effect of a new treatment on the recovery from a disease. For each group, we perform a randomized trial, but with different probabilities. The treatment is binary, and the outcome is whether the patient recovered from the diseases. Hence, the treatment effect is the the difference in the probabilities of observing a recovering (1). We are interested in the average treatment effect (ATE) on the entire population. Here is the data we observed. The numbers in each cell represent the

Group / Treatment	Treatment 1	Treatment 0
Age \(\leq 50\)	18 out of 20 (90%)	150 out of 180 (83.3%)
Age \(> 50\)	50 out of 80 (62.5%)	10 out of 20 (50%)
Pooled	68 out of 100 (68%)	160 out of 200 (80%)

We can see that the treatment effect is positive (high probability to recover) in both groups, but negative (worse than control) overall. This is known as the Simpson’s paradox. Even we used a randomized control trial in both groups, the difference-in-means estimator is biased. Which assumption in our previous lecture is violated?

The random treatment assignment assumption is violated in this case. It is not completely independent of the potential outcomes. In group A, the potential outcomes are actually higher (it is overall more likely to recover), but the treatment assignment is more likely to be 0. While in group B, the potential outcomes are lower, but the treatment assignment is more likely to be 1.

Although this dependency may not be in purpose, it is not uncommon in practice. If we know this ahead of time, we could consider estimating the average treatment effect (ATE) separately in each group, where the chance of treatment assignment remains constant. And then we can average the effect in each group, based on their respective sample size. In this case, the estimated ATE would be:

\[ \begin{aligned} \tau &= \frac{200}{300} \times \left(\frac{18}{20} - \frac{150}{180}\right) + \frac{100}{300} \times \left(\frac{50}{80} - \frac{10}{20}\right) \\ & \approx 8.63\% \end{aligned} \]

3 A Different View

Let’s consider a different view of the same formula given above. Let’s group the two parts associated with treatment 1 as

\[ \begin{aligned} & \frac{200}{300} \times \frac{18}{20} + \frac{100}{300} \times \frac{50}{80} \\ =& \frac{1}{300} \times \left( \frac{18}{20 / 200} + \frac{50}{80 / 100} \right) \end{aligned} \]

which effectively becomes this following formula:

\[ \frac{1}{n} \left( \sum_{i = 1}^{n_1} \frac{A_i Y_i}{ \widehat \Pr(A_i = 1| \text{Group 1})} + \sum_{i = n_1+1}^{n} \frac{A_i Y_i}{ \widehat \Pr(A_i = 1| \text{Group 2})} \right) \]

If we further define the group label as covariate \(X_i\), where \(X_i = 1\) if the patient is in group 1, and \(X_i = 0\) if the patient is in group 2. Then the above formula can be written as

\[ \frac{1}{n} \sum_{i = 1}^{n} \frac{A_i Y_i}{ \Pr(A_i = 1| X_i )} \]

And similarly, we can write the part associated with treatment 0, and combine the two as an estimator of the ATE:

\[ \widehat\tau_{\text{ipw}} = \frac{1}{n} \sum_{i = 1}^{n} \left( \frac{A_i Y_i}{ \widehat e(X_i) } - \frac{(1 - A_i) Y_i}{ 1 - \widehat e(X_i) } \right) \]

where \(e(X_i) = \Pr(A_i = 1| X_i )\) is the probability of receiving treatment given the covariate \(X_i\). \(e(X_i)\) is also known as the propensity score.

Compared with the difference-in-means estimator, the above formula is a weighted average estimator, with each observation weighed by the chance of receiving their respective treatment label so that the sum scales back to the total of \(n\) observations. This technique is known as the inverse propensity weighting (IPW). The estimator was originally proposed by Horvitz and Thompson (1952) in survey sampling and then used in causal inference.

4 Assumptions

Recall that our ultimate goal is to estimate the average treatment effect (ATE), following the potential outcome notation:

\[ \tau = \E[Y_i(1) - Y_i(0)] \]

Our goal here is to establish some properties of the new estimator \(\widehat\tau_{\text IPW}\). To do so, we need to establish some assumptions. Previously, we had two assumptions, Independence and SUTVA. It is obvious that the independence assumption is no longer valid. But some new assumptions could help us to establish the validity of the IPW estimator, besides the SUTVA assumption.

Unconfoundedness: \(A_i \perp \{Y_i(0), Y_i(1)\} | X_i\)

Essentially this is the same as the independence assumption, but now we condition on the covariate \(X_i\). Why do we call it unconfoundedness? Because this eliminates the potential bias in our estimation if we do not conditioning on some variable. Here, conditioning on \(X_i\) is sufficient. The assumption has many different names, such as strong ignorability or conditional independence. There is an even weaker version of this called ignorability, readers are referred to (Rubin 1978) and (Ding 2023) for more details. The difference is rather technical and is not the focus of this lecture.

Positivity: The treatment assignment is not deterministic for any value of \(X\), i.e., \(\eta \leq P(A = 1 | X = x) \doteq e(x) \leq 1 - \eta\)

This is to ensure that we have individuals from both treatment labels for each value of \(X\) so that we can compare them. This may seem to be pretty natural, but when the cardinality of \(X\) is large, this could be violated.

5 Inverse-propensity Weighting

Because estimating \(\widehat e(X_i)\) comes with additional randomness, and could potentially harm the performance of the estimator, let’s first consider a version with \(e(X_i)\) known (e.g., by design). Then the estimator becomes

\[ \widehat\tau_\text{ipw}^\ast = \frac{1}{n} \sum_{i = 1}^{n} \left( \frac{A_i Y_i}{ e(X_i) } - \frac{(1 - A_i) Y_i}{ 1 - e(X_i) } \right) \]

First, we can see that this estimator is an unbiased estimator of the ATE:

\[ \begin{aligned} \E[ \widehat\tau_\text{ipw}^\ast ] =& \E \left[ \frac{A_i Y_i}{ e(X_i) } - \frac{(1 - A_i) Y_i}{ 1 - e(X_i) } \right] \quad \text{by IID} \\ =& \E \left[ \frac{A_i Y_i(1)}{ e(X_i) } - \frac{(1 - A_i) Y_i(0)}{ 1 - e(X_i)} \right] \quad \text{by SUTVA} \\ =& \E \left[ \E \left[ \frac{A_i Y_i(1)}{ e(X_i) } \Biggm| X_i \right] - \E \left[ \frac{(1 - A_i) Y_i(0)}{ 1 - e(X_i)} \Biggm| X_i \right] \right] \\ =& \E \left[ \frac{e(X_i)}{ e(X_i) } \E[ Y_i(1) | X_i ] - \frac{1 - e(X_i)}{1- e(X_i) } \E [Y_i(0) | X_i ] \right] \quad \text{by Unconfoundedness} \\ =& \E \, [ Y_i(1) - Y_i(0) ] \\ \end{aligned} \]

Now, to bridge the gap between the known \(e(X_i)\) and the estimated \(\widehat e(X_i)\), we can further assume that

\(|Y_i| \leq M\)
\(\sup_{x} \left| e(x) - \widehat e(x) \right| \rightarrow 0\)

Then, it should be easy to see that \(\left| \widehat\tau_\text{ipw}^\ast - \widehat\tau_\text{ipw} \right|\) also converges to 0, at the same rate of \(\widehat e(x)\) converges to \(e(x)\). Furthermore, we can see that \(\widehat\tau_\text{ipw}^\ast\) is an average of iid terms, which will enjoy \(\sqrt{n}\) rate of consistency. Hence \(\widehat\tau_\text{ipw}\) will also be consistent.¹

6 The Density Ratio View

The idea of propensity score weighting is essentially manipulating the distribution of the covariate \(X\) in the treated and control groups so that they resemble the overall distribution of \(X\). To see this, let’s consider having two samples

Treated group: \(\{X_i\}_{i: A_i = 1}\), with density \(f_1(x)\)
Control group: \(\{X_i\}_{i: A_i = 0}\), with density \(f_0(x)\)

Then the overall distribution of \(X\) can be written as a mixture of the two groups:

\[ f(x) = \pi_1 f_1(x) + \pi_0 f_0(x) \]

where \(\pi_1\) and \(\pi_0\) are the proportions of treated and control subjects in the overall population. Note that using the Bayes identity, we have

\[ \begin{aligned} f_1(x) &= f(x | A = 1) = \frac{e(X) f(x)}{\Pr(A = 1)} \\ \rightarrow \quad f(x) &= \frac{\Pr(A = 1)}{e(X)} f_1(x) \end{aligned} \]

Now, if we look at the expected potential outcome under treatment for the entire population, we have

\[ \begin{aligned} E_{f}[Y(1)] &= \int E[Y(1) | X = x] f(x) dx \\ &= \int E[Y(1) | X = x] \frac{\Pr(A = 1)}{e(x)} f(x | A = 1) dx \\ &= \E_{X \sim f_1} \left[ E[Y(1) | X] \frac{\Pr(A = 1)}{e(X)} \right] \\ &= \E_{X \sim f_1} \left[ E[Y(1) | X, A = 1] \frac{\Pr(A = 1)}{e(X)} \right] \\ &= \E_{X \sim f_1} \left[ E[Y(1) | X, A = 1] \frac{A}{e(X)} \right] \end{aligned} \]

Hence a natural estimator of \(E_f[Y(1)]\) is

\[ \widehat E_f[Y(1)] = \frac{1}{n} \sum_{i = 1}^{n} \frac{A_i Y_i}{e(X_i)} \]

This means that we are reweighting the treated group by the density ratio \(\Pr(A = 1) / e(X)\) to recover the overall average potential outcome under treatment. Similarly, we can reweight the control group by \(\Pr(A = 0) / (1 - e(X))\) to recover the overall average potential outcome under control. Combining the two, we arrive at the IPW estimator again.²

7 Examples

7.1 Example 1: Direct Implementation

The lalonde dataset in the Maching package is a popular dataset for causal effect. It contains the following variables:

treat: a binary variable indicating whether the subject received the treatment
age: the age of the subject
educ: the education level of the subject
black: a binary variable indicating whether the subject is black
hisp: a binary variable indicating whether the subject is Hispanic
married: a binary variable indicating whether the subject is married
nodegr: a binary variable indicating whether the subject has a degree
re74: the real earnings in 1974
re75: the real earnings in 1975
re78: the real earnings in 1978

LaLonde (1986) was interested in the causal effect of a job training program on earnings. This is in fact an experimental data, meaning that the participants were randomly assigned to the treatment and control groups. However, we can still use this dataset to illustrate the IPW method. If we consider a naive difference-in-means estimator, we would have:

  library(Matching)
  data(lalonde)
  head(lalonde)

  
  # Calculate the average earnings for treated and control groups
  avg_treated <- mean(lalonde$re78[lalonde$treat == 1])
  avg_control <- mean(lalonde$re78[lalonde$treat == 0])
  
  # Calculate the difference in means
  ate_diff_means <- avg_treated - avg_control
  
  ate_diff_means
## [1] 1794.343

As we have learned in this lecture, we would like to use the IPW estimator to adjust for potential bias.

  ps_model <- glm(treat ~ age + educ + black + hisp + married + nodegr + re74 + re75, 
                  family = binomial(link = "logit"), data = lalonde)
  
  # Extract the propensity scores
  lalonde$pscore <- predict(ps_model, type = "response")
  lalonde$psweight <- ifelse(lalonde$treat == 1, 1 / lalonde$pscore, 1 / (1 - lalonde$pscore))
  
  mean(lalonde$treat*lalonde$re78*lalonde$psweight - (1 - lalonde$treat)*lalonde$re78*lalonde$psweight)
## [1] 1613.135

7.2 Example 2: Propensity Score Stratification

An interesting fact is that we do not need to know the full information of \(X\) to satisfy the conditional independence assumption. We only need to know the propensity score \(e(X)\). Formally, this means that

\[ \text{If} \quad A_i \perp \{Y_i(0), Y_i(1)\} | X_i \quad \text{then} \quad A_i \perp \{Y_i(0), Y_i(1)\} | e(X_i) \]

The proof is relatively straightforward, by showing that the conditional distribution of \(A_i\) given \(e(X_i)\), \(Y_i(1)\) and \(Y_i(0)\) is the same as the conditional distribution of \(A_i\) given \(e(X_i)\). And this leads to a new idea: we can first fit a model to estimate the propensity score, and then stratify the data based on the estimated propensity score. Then we can estimate the ATE within each stratum, and then average the stratum-specific ATEs to obtain the overall ATE. This is known as the propensity score stratification method. The benefit of this compared with the IPW method is that the propensity score does not need to be vary accuarte, but we only need their rankings to be accurate (you may also argue that this is a difficult task).

There is a natural trade-off here caused by the number of strata. If we have too many strata, then the variance of the estimator will be large. If we have too few strata, then the bias of the estimator will be large. The following code shows how to implement the method with the lalonde dataset. We use an adhoc choice of 7 strata, which ensures a good number of observations in each stratum.

  lalonde$pscore_strata <- cut(lalonde$pscore, breaks = 7)
  lalonde$pscore_strata <- as.numeric(lalonde$pscore_strata)
  
  # Estimate the ATE within each stratum
  ate_strata <- tapply(lalonde$re78, lalonde$pscore_strata, function(x) mean(x[lalonde$treat == 1]) - mean(x[lalonde$treat == 0]))
  
  means1_within_strata <- aggregate(re78 ~ pscore_strata, data = lalonde[lalonde$treat == 1,], FUN = mean)
  means0_within_strata <- aggregate(re78 ~ pscore_strata, data = lalonde[lalonde$treat == 0,], FUN = mean)
  
  cbind('strata' = means1_within_strata$pscore_strata, 
        'treated' = means1_within_strata$re78,
        'control' = means0_within_strata$re78,
        'strata size' = table(lalonde$pscore_strata))
##   strata  treated  control strata size
## 1      1 8564.331 6020.404          28
## 2      2 3416.432 6817.987          17
## 3      3 5180.502 4085.232         206
## 4      4 6631.739 4517.497          79
## 5      5 6518.722 5429.714          31
## 6      6 8019.782 3603.864          66
## 7      7 7040.979 7132.533          18
  
  sum((means1_within_strata$re78 - means0_within_strata$re78)* table(lalonde$pscore_strata)) / nrow(lalonde)
## [1] 1639.586

7.3 Example 3: Maching and the Average Treatment Effect of the Treated

Up to now, we have not explicitly discussed the population over which the treatment effect is being defined. In the previous examples, we implicitly estimated the average treatment effect (ATE) for the entire population that generated our observed sample of covariates \(X_i\)’s. In that setting, the sample is assumed to be randomly drawn from a single population that includes both treated and untreated individuals. However, in some studies this assumption does not hold, and the researcher’s interest may lie in a more specific population. For example, treated subjects may come from a particular hospital, while control subjects are obtained from an external public database. In such cases, the goal is often to estimate the treatment effect for the treated-source population, i.e., the population represented by patients in that hospital, rather than for the combined population. This is known as the average treatment effect of the treated (ATT). Formally, our target of interest is (for whatever implicit population that \(A_i = 1\) refers to)

\[ \tau_{\text{ATT}} = \E[Y_i(1) - Y_i(0) | A_i = 1] \]

Note that we can further condition on the propensity score \(e(X_i)\) to get independence between the treatment assignment \(A\) and the potential outcomes:

\[ \begin{aligned} \tau_{\text{ATT}} &= \E[ \E[Y_i(1) - Y_i(0) | e(X_i), A_i = 1] | A_i = 1] \\ &= \E[ \E[Y_i(1) | e(X_i), A_i = 1] - \E[Y_i(0) | e(X_i), A_i = 1] | A_i = 1] \\ &= \E[ \E[Y_i(1) | e(X_i)] - \E[Y_i(0) | e(X_i)] | A_i = 1] \quad \text{(by unconfoundedness)} \\ &= \E[ \E[Y_i(1) | e(X_i), A_i = 1] - \E[Y_i(0) | e(X_i), A_i = 0] | A_i = 1] \\ &= \E[ Y_i | e(X_i), A_i = 1] - \E[Y_i | e(X_i), A_i = 0] | A_i = 1] \quad \text{(by SUTVA)} \\ \end{aligned} \]

Note that the first term can be estimated by directly average all treated subjects, while the second term, for subject \(i\), can be done by (locally) averaging all control subjects with the same propensity score \(e(X_i)\), i.e., the estimator of ATT is

\[ \widehat\tau_{\text{ATT}} = \frac{1}{n_T} \sum_{i: A_i = 1} \left( Y_i - \widehat E[Y_i | e(X_i), A_i = 0] \right) \]

We can use the same stratification idea by binning the propensity scores and treat the subjects within each stratum as having the same propensity score³. Alternatively, we match each treated subject with a set of control subject by finding the closest one, or a set of closest \(M\) observations of the propensity score \(e(X_i)\) of that treated subject. Then the difference of each of such matched pair can be viewed as an estimate of the treatment effect given the \(e(X_i)\). The following code is an implementation of this idea with the lalonde dataset.

  matching_control = data.frame(pscore = lalonde$pscore[lalonde$treat == 0], 
                                outcome = lalonde$re78[lalonde$treat == 0])

  matching_treat = data.frame(pscore = lalonde$pscore[lalonde$treat == 1], 
                              outcome = lalonde$re78[lalonde$treat == 1])
  
  ntreat = nrow(matching_treat)
  matched_outcome = rep(NA, ntreat)
  
  # number of matches
  M = 5
  
  for (i in 1:ntreat) 
  {
    abs_diffs = abs(matching_treat$pscore[i] - matching_control$pscore)
    idmatch = order(abs_diffs)[1:M]
    matched_outcome[i] = mean(matching_control$outcome[idmatch])
  }
  
  mean(matching_treat$outcome - matched_outcome)
## [1] 2246.787

We can see that the estimated ATT is very different from the ATE. This is because we are focusing on a different population, not all the samples in the data. ***

Ding, Peng. 2023. “A First Course in Causal Inference.” arXiv Preprint arXiv:2305.18793.

Horvitz, Daniel G, and Donovan J Thompson. 1952. “A Generalization of Sampling Without Replacement from a Finite Universe.” Journal of the American Statistical Association 47 (260): 663–85.

Rubin, Donald B. 1978. “Bayesian Inference for Causal Effects: The Role of Randomization.” The Annals of Statistics, 34–58.

An interesting result is that \(\widehat\tau_{\text IPW}\) with estimated propensity score will actually perform better than \(\widehat\tau_{\text IPW}^\ast\), which directly use the true propensity score. Intuitively, we can understand this as the estimated propensity score accounts for the uncertainties of the total subjects that actually received the treatment/control.↩︎
This view is closely related to Radon-Nikodym theorem, which states that \[ \mathbb{E}_P[g(X)] = \mathbb{E}_Q\left[g(X) \frac{dP}{dQ}(X)\right] \] where \(P\) and \(Q\) are two probability measures, and \(\frac{dP}{dQ}(X)\) is the Radon-Nikodym derivative (or density ratio) of \(P\) with respect to \(Q\). Here, we can view the overall distribution of \(X\) as \(P\), and the distribution of \(X\) in the treated group as \(Q\). Then the density ratio is exactly \(\Pr(A = 1) / e(X)\). This idea will also be used later in the outcome weighted learning method for personalized medicine.↩︎
Another alternative we can consider is the covariate matching, which matches the treated and control subjects based on their covariates \(X_i\) directly. The idea is similar, and we leave it as an exercise for the readers.↩︎