Overview

We have now introduced the inverse probability weighting (IPW) estimator as

\[ \hat{\tau}_{\text{ipw}} = \frac{1}{n} \sum_{i=1}^n \left( \frac{Y_i A_i}{\hat{e}(X_i)} - \frac{Y_i (1 - A_i)}{1 - \hat{e}(X_i)} \right) \tag{1} \]

On the other hand, if we are able to come up with some good regression estimator of the expectation of the outcome \(Y\), say, \(\hat{\mu}_{(1)}(X)\), for those who received treatment, and \(\hat{\mu}_{(0)}(X)\), for those who did not receive treatment, then we can use the regression estimator of the ATE as

\[ \hat{\tau}_{\text{reg}} = \frac{1}{n} \sum_{i=1}^n \left( \hat{\mu}_{(1)}(X_i) - \hat{\mu}_{(0)}(X_i) \right) \tag{2} \]

An interesting idea of combining these two estimators is called an augmented inverse probability weighted (AIPW) estimator (Robins, Rotnitzky, and Zhao 1994), which is defined as

\[ \hat{\tau}_{\text{aipw}} = \frac{1}{n} \sum_{i=1}^n \left( \hat{\mu}_{(1)}(X_i) - \hat{\mu}_{(0)}(X_i) + \frac{A_i}{\hat{e}(X_i)} \left(Y_i - \hat{\mu}_{(1)}(X_i) \right) - \frac{(1 - A_i)}{1 - \hat{e}(X_i)} \left(Y_i - \hat{\mu}_{(0)}(X_i) \right) \right) \tag{3} \]

By just looking at this estimator, it is attempting to do a correction of the regression estimator. For an observation, \(Y_i - \hat{\mu}_{(1)}(X_i)\) is the error of the regression estimator for those who received treatment, and using a weighted version of this quantity would adjust the regression estimator. The same is true for those who did not receive treatment. Interestingly, this estimator displays a doubly robust property, which means that if either the regression estimator or the IPW estimator is consistent, then the AIPW estimator is consistent.

Doubly Robust Property

To understand the double robustness property, let’s consider the expectation version. This is reasonably close to the sample average as long as we have a large sample size.

\[ \mathbb{E} \left[ \hat{\mu}_{(1)}(X) + \frac{A}{\hat{e}(X)} \left(Y - \hat{\mu}_{(1)}(X) \right) \right] - \mathbb{E} \left[ \hat{\mu}_{(0)}(X) + \frac{(1 - A)}{1 - \hat{e}(X)} \left(Y - \hat{\mu}_{(0)}(X) \right) \right] \tag{4} \]

For the first term, it should be targeting \(\E [ Y(1) ]\) as an unbiased estimator. Let’s rearrange the first term and look at its difference between the two:

\[ \begin{aligned} &\E \left[ \hat{\mu}_{(1)}(X) + \frac{A}{\hat{e}(X)} \left(Y - \hat{\mu}_{(1)}(X) \right) - Y(1) \right] \\ =& \E \left[ \frac{A}{\hat{e}(X)} \left(Y - \hat{\mu}_{(1)}(X) \right) - \left( Y(1) - \hat{\mu}_{(1)}(X) \right) \right] \\ =& \E \left[ \frac{A - \hat{e}(X)}{\hat{e}(X)} \left( Y(1) - \hat{\mu}_{(1)}(X) \right) \right] \quad \text{by SUTVA} \\ =& \E \left[ \E \left[ \frac{A - \hat{e}(X)}{\hat{e}(X)} \left( Y(1) - \hat{\mu}_{(1)}(X) \right) \Biggm| X \right] \right] \\ =& \E \left[ \E \left[ \frac{A - \hat{e}(X)}{\hat{e}(X)} \Biggm| X \right] \times \E \left[ Y(1) - \hat{\mu}_{(1)}(X) \Biggm| X \right] \right] \quad \text{by Unconfoundedness}\\ =& \E \left[ \frac{e(X) - \hat{e}(X)}{\hat{e}(X)} \times \left[ \mu(1) - \hat{\mu}_{(1)}(X) \right] \right] \end{aligned} \]

Then, it is easy to see that whenever \(\hat{\mu}_{(1)}(X)\) or \(\hat{e}(X)\) is unbiased, the expectation is zero, meaning that the estimator is unbiased. Derivation of the second term is almost identical. It maybe interesting to establish an upper bound of the bias. If we apply the Cauchy-Schwarz inequality,

\[ |\E(X Y)| \leq \sqrt{\E(X^2) \E(Y^2)} \]

We can have

\[ \begin{aligned} & \biggm| \mathbb{E} \left[ \frac{e(X) - \hat{e}(X)}{\hat{e}(X)} \times \left[ \mu(1) - \hat{\mu}_{(1)}(X) \right] \right] \biggm| \\ \leq & \sqrt{\mathbb{E} \left[ \left( \frac{e(X) - \hat{e}(X)}{\hat{e}(X)} \right)^2 \right] \times \mathbb{E} \left[ \left( \mu(1) - \hat{\mu}_{(1)}(X) \right)^2 \right]} \end{aligned} \]

This suggests that the bias of the estimator is upper bounded by the product of the biases of \(\hat{e}(X)\) and \(\hat{\mu}_{(1)}(X)\). Hence as long as one of the two estimators has zero bias, the AIPW estimator also has zero bias.

An alternative view of this bias reduction can be seen from re-organize the terms in (4):

\[ \mathbb{E} \left[ \hat{\mu}_{(1)}(X) - \hat{\mu}_{(0)}(X) \right] + \mathbb{E} \left[ \frac{A}{\hat{e}(X)} \left(Y - \hat{\mu}_{(1)}(X) \right) + \frac{(1 - A)}{1 - \hat{e}(X)} \left(Y - \hat{\mu}_{(0)}(X) \right) \right] \]

In this case, the first term \(\mathbb{E} \left[ \hat{\mu}_{(1)}(X) - \hat{\mu}_{(0)}(X) \right]\) as the regression estimator of the ATE, while the second term is some correction term for the bias. Each \(Y - \hat{\mu}(X)\) term is representing some bias of the corresponding \(X\) and they are weighted by the propensity score.

Variance of the AIPW Estimator

The variance of the AIPW estimator is a bit more complicated. Let’s re-organize the terms in (4) a bit before the analysis

\[ \mathbb{E} \left[ \hat{\mu}_{(1)}(X) - \hat{\mu}_{(0)}(X) \right] + \mathbb{E} \left[ \frac{A}{\hat{e}(X)} \left(Y - \hat{\mu}_{(1)}(X) \right) - \frac{(1 - A)}{1 - \hat{e}(X)} \left(Y - \hat{\mu}_{(0)}(X) \right) \right] \]

We typically would expect the first term to have very small variance (even if they are biased) because they are essentially mean of some type of mean quantity. Hence the variance would mainly come from the second term. We can analyze the part associated with \(A = 1\):

\[ \begin{aligned} & \Var \left[ \frac{A}{\hat{e}(X)} \left( Y - \hat{\mu}_{(1)}(X) \right) \right] \\ =& \Var \left[ \frac{A}{\hat{e}(X)} \left( Y(1) - \hat{\mu}_{(1)}(X) \right) \right] \quad \text{by SUTVA} \\ \leq & \E \left[ \frac{A}{\hat{e}^2(X)} \left( Y(1) - \hat{\mu}_{(1)}(X) \right)^2 \right] \quad \text{by}\,\, \Var(X) = \E(X^2) - \E^2(X)\\ = & \E \left[ \frac{e(X)}{\hat{e}^2(X)} \times \E \left[ \left( Y(1) - \hat{\mu}_{(1)}(X) \right)^2 \biggm| X \right] \right] \quad \text{by Unconfoundedness} \end{aligned} \]

This would be usually much smaller than not subtracting \(Y(1)\) by a conditional mean regression estimator. Of course this may still be affected by many factors, the variance of the AIPW estimator is typically smaller than the variance of the IPW estimator.

Example: Comparing Different Estimators

Let’s compare several estimators we have covered so far:

The regression based estimator given in (2)
The original IPW estimator given in (1)
The Hajek estimator in homework 1
The doubly robust estimator given in (3)

The following example is modified from from Ding (2023).

  ATE_Est <- function(a, y, x, 
                      truncps = c(0, 1)) # truncate propensity score if needed (positivity assumption)
  {
    ate_data = data.frame("a" = a, "y" = y, x)
    
    # fitted propensity score
    pscore <- glm(a ~ . - y, data = ate_data, family = "binomial")$fitted.values
    pscore <- pmax(truncps[1], pmin(truncps[2], pscore))
    
    # fitted potential outcomes
    # weights simply restrict to part of the data based on the label by still predict all subjects
    outcome1 <- lm(y ~ . - a, data = ate_data, weights = a)$fitted.values
    outcome0 <- lm(y ~ . - a, data = ate_data, weights = (1 - a))$fitted.values
    
    # outcome regression estimator
    ate_reg <- mean(outcome1 - outcome0)
    
    # IPW estimators
    y_treat <- mean(a * y / pscore)
    y_control <- mean((1 - a) * y / (1 - pscore))
    ate_ipw <- y_treat - y_control
    
    # Hajek estimator (see HW1)
    one_treat <- mean(a / pscore)
    one_control <- mean((1 - a) / (1 - pscore))
    ate_hajek <- y_treat / one_treat - y_control / one_control
    
    # doubly robust estimator
    res1 <- y - outcome1
    res0 <- y - outcome0
    r_treat <- mean(a * res1 / pscore) # residual correction term for treated
    r_control <- mean((1 - a) * res0 / (1 - pscore)) # residual correction term for control
    ate_dr <- ate_reg + r_treat - r_control
    
    return(c(ate_reg, ate_ipw, ate_hajek, ate_dr))
  }

We will consider four different cases to compare these estimators:

When both the propensity score and the outcome model are correctly specified
When the propensity score is incorrectly specified
When the outcome model is incorrectly specified
When both the propensity score and the outcome model are incorrectly specified

  n = 500
  nsim = 1000
  set.seed(1)
  
  data_gen <- function(n, 
                       ps = TRUE, # TRUE/FALSE for correct propensity score
                       reg = TRUE) # TRUE/FALSE for correct regression model
  {
        # two dimensional covariates plus intercept
    x <- matrix(rnorm(n * 2), n, 2)
    
    if (ps) # TRUE/FALSE for correct propensity score
    {
      # generate propensity score and treatment label
      beta_ps <- c(1, 1)
      pscore <- 1 / (1 + exp(-as.vector(x %*% beta_ps)))
      a <- rbinom(n, 1, pscore)
    }else{
      # nonlinear propensity score
      x1 = cbind(x, exp(x))
      beta_ps <- c (0, 0, 1, -1)
      pscore <- 1 / (1 + exp(1 - as.vector(x1 %*% beta_ps)))
      a <- rbinom(n, 1, pscore)
    }
    
    if (reg) # TRUE/FALSE for correct regression model
    {
      # generate potential outcomes
      beta_y1 <- c(2, 1)
      beta_y0 <- c(2, 1)
      y1 <- rnorm(n, 1 + x %*% beta_y1)
      y0 <- rnorm(n, 1 + x %*% beta_y0)
    }else{
      # generate potential outcomes
      x1 = cbind(x, exp(x))
      beta_y1 <- c(1, -1, 0.5, -0.2)
      beta_y0 <- c(-1, 1, -0.5, 0.2)
      y1 <- rnorm(n, 1 + x1 %*% beta_y1)
      y0 <- rnorm(n, 1 + x1 %*% beta_y0)
    }

    # observed outcome
    y <- a * y1 + (1 - a) * y0    
    
    return(list(a = a, y = y, x = x, 
                tau = mean(y1 - y0))) # tau is the true ATE for this dataset
  }

Case 1: All Correctly Specified

In this case, the data generator is exactly what our model is. Hence, the doubly robust estimator should perform very well. It should also be noticed that since the regression estimator is also correct and pretty much optimal (root-\(n\) sense), there is really not much to improve from it.

  result_pool = matrix(NA, nsim, 4)
  
  for (k in 1:nsim)
  {
    mydata = data_gen(n, ps = TRUE, reg = TRUE)
    
    # estimated values
    result_pool[k, ] = ATE_Est(mydata$a, mydata$y, mydata$x) - mydata$tau
  }
  
  # summarize results from all simulations 
  summary_mat = rbind(apply(result_pool, 2, mean),
                      apply(result_pool, 2, sd))
  
  rownames(summary_mat) <- c("estimate", "se")
  colnames(summary_mat) <- c("reg", "ipw", "Hajek", "DR")
  round(summary_mat, 5)
##              reg      ipw   Hajek      DR
## estimate 0.00150 -0.00357 0.01170 0.00563
## se       0.08714  0.35163 0.29993 0.10571

Case 2: Incorrectly Specified Propensity Score

For this case, we specify a different data generator for the propensity score. Hence our working model would be specified wrong. The IPW estimator can be very biased in this case.

  result_pool = matrix(NA, nsim, 4)
  
  for (k in 1:nsim)
  {
    mydata = data_gen(n, ps = FALSE, reg = TRUE)
    
    # estimated values
    result_pool[k, ] = ATE_Est(mydata$a, mydata$y, mydata$x) - mydata$tau
  }
  
  # summarize results from all simulations 
  summary_mat = rbind(apply(result_pool, 2, mean),
                      apply(result_pool, 2, sd))
  
  rownames(summary_mat) <- c("estimate", "se")
  colnames(summary_mat) <- c("reg", "ipw", "Hajek", "DR")
  round(summary_mat, 5)
##               reg      ipw    Hajek       DR
## estimate -0.00470 -0.68136 -0.64346 -0.01031
## se        0.10487  0.71156  0.50183  0.18670

Case 3: Incorrectly Specified Outcome Model

The outcome model is modified to include nonlinear terms. In this case, the regression estimator would be biased, and the DR estimator is also slightly affected by that bias. However, the IPW and Hajek estimators are still unbiased because they are not affected by the regression model. Doubly robust estimator is still unbiased.

  result_pool = matrix(NA, nsim, 4)
  
  for (k in 1:nsim)
  {
    mydata = data_gen(n, ps = TRUE, reg = FALSE)
    
    # estimated values
    result_pool[k, ] = ATE_Est(mydata$a, mydata$y, mydata$x) - mydata$tau
  }
  
  # summarize results from all simulations 
  summary_mat = rbind(apply(result_pool, 2, mean),
                      apply(result_pool, 2, sd))
  
  rownames(summary_mat) <- c("estimate", "se")
  colnames(summary_mat) <- c("reg", "ipw", "Hajek", "DR")
  round(summary_mat, 5)
##               reg      ipw    Hajek       DR
## estimate -0.13398 -0.00158 -0.00435 -0.00228
## se        0.11941  0.31770  0.29824  0.20697

Case 4: Incorrectly Specified Outcome Model and Propensity Score

In this case, all estimators are very bad. Hence there is a price to pay for being doubly robust. It can in fact be doubly fragile.

  result_pool = matrix(NA, nsim, 4)
  
  for (k in 1:nsim)
  {
    mydata = data_gen(n, ps = FALSE, reg = FALSE)
    
    # estimated values
    result_pool[k, ] = ATE_Est(mydata$a, mydata$y, mydata$x) - mydata$tau
  }
  
  # summarize results from all simulations 
  summary_mat = rbind(apply(result_pool, 2, mean),
                      apply(result_pool, 2, sd))
  
  rownames(summary_mat) <- c("estimate", "se")
  colnames(summary_mat) <- c("reg", "ipw", "Hajek", "DR")
  round(summary_mat, 5)
##               reg      ipw    Hajek      DR
## estimate -0.20039 -0.26808 -0.38877 0.32407
## se        0.16309  0.48331  0.46917 0.51623

Cross-fitting in Doubly Robust Estimation

There is actually a little catch in our previous discussion about the doubly robust estimator. The estimator is only doubly robust if the propensity score and the outcome model are estimated separately. Another way to say this is that it would be ideal if the errors of the propensity score and the outcome model are independent from the observations that are used for estimating the ATE. To make this more concrete, in equation @(eq:aipw), we would want the estimators \(\hat{e}(X)\) and \(\hat{\mu}(X)\) to be obtained based on samples independent from the samples \(\{X_i, A_i, Y_i\}_{i=1}^n\). These estimators of \(\hat{e}(X)\) and \(\hat{\mu}(X)\) are called plug-in estimators, and we want to use a separate sample to estimate them. Hence the only option here is to split the original sample into two parts, one for estimating the propensity score and the outcome model, and after that, we use the other half of the sample to perform the AIPW estimation. This phenomenon is a result from the semiparametric efficiency theory which is an advanced statistical topic Bickel et al. (1993). More details can be found in the text book Kosorok (2008).

The following updated code implements this idea. However, for this case, it does not really improve the performance (standard error) much mainly because the \(\hat{e}(X)\) and \(\hat{\mu}(X)\) estimators are both theoretically root-\(n\) speed already since they are simple parametric models. In cases when there are more general machine learning models being used (Chernozhukov et al. 2018), the effect of cross-fitting can be seen more clearly. To highly the results from semiparametric efficiency theory, we would only require \(\hat{e}(X)\) and \(\hat{\mu}(X)\) to be \(\sqrt[4]n\) consistency to ensure the doubly robust estimator to be root-\(n\) consistent. This is a much less stringent requirement than what we typically need. ¹

  dr_cross <- function(a, y, x, idx)
  {
    alldata = data.frame(a, y, x)
    data_plugin = alldata[-idx, ]
    data_est = alldata[idx, ]

    # fitted propensity score
    ps_model <- glm(a ~ . - y, data = data_plugin, family = "binomial")
    pscore_est <- predict(ps_model, newdata = data_est, type = "response")
    
    
    # fitted potential outcomes
    outcome_model1 <- glm(y ~ . - a, data = data_plugin, weights = a, family = "gaussian")
    outcome_est1 <- predict(outcome_model1, newdata = data_est)
    outcome_model0 <- glm(y ~ . - a, data = data_plugin, weights = (1 - a), family = "gaussian")
    outcome_est0 <- predict(outcome_model0, newdata = data_est)
    
    # doubly robust estimator
    res1_est <- data_est$y - outcome_est1
    res0_est <- data_est$y - outcome_est0
    r_treat_est <- mean(data_est$a * res1_est / pscore_est) # residual correction term for treated
    r_control_est <- mean((1 - data_est$a) * res0_est / (1 - pscore_est)) # residual correction term for control
    
    dr_crossfit <- mean(outcome_est1 - outcome_est0) + r_treat_est - r_control_est
    
    # estimated values
    return(dr_crossfit)
  }

  result_drcross = rep(NA, nsim)
  
  for (k in 1:nsim)
  {
    mydata = data_gen(n, ps = TRUE, reg = TRUE)
    
    # fit the cross-fitted doubly robust estimator
    M = 5 # use a 5 fold version
    allidx = sample(1:n, n)
    index <- seq_along(allidx)
    factor_levels <- cut(index, breaks=M, labels=FALSE)
    chunks <- split(allidx, factor_levels)

    cross_est_fold = rep(NA, M)
    for (j in 1:M)
    {
      cross_est_fold[j] = dr_cross(mydata$a, mydata$y, mydata$x, chunks[[j]])
    }

    # estimated values
    result_drcross[k] = mean(cross_est_fold) - mydata$tau
  }
  
  # summarize results from all simulations 
  mean(result_drcross)
## [1] 0.01099088
  sd(result_drcross)
## [1] 0.117282

Bootstrap Estimation of Variance

In practice, it is often very difficult to theoretically derive the variance of an estimator, since it may be affected by many unknown components. One way to estimate the variance is to use the bootstrap. It is well known that the bootstrap approach can be used to approximate the sampling distribution. Mathematically, if we denote the observed data as \(X_1, X_2, \ldots, X_n\), drawn independently from an underlying distribution \(\mathbf{P}\), then we define the empirical distribution of this sample as \(\mathbf{P}_n = n^{-1} \sum_{i = 1}^n \delta(X_i)\), where \(\delta\) places a point mass at \(X_i\) and is 0 otherwise. The following is an example of the empirical estimation of the c.d.f of a univariate variable.

  # obtain some samples from normal
  n = 50
  x = rnorm(n)
  
  # calculate the empirical cdf using apply
  u = seq(-3.5, 3.5, length.out = 200)

  mycdf <- function(u, X)
  {
    cdfe = rep(0, length(u))
    for (i in 1:length(u))
      cdfe[i] = mean(X <= u[i])
    return(cdfe)
  }

  Fn = mycdf(u, x)
  
  # plot the estimated cdf
  plot(u, Fn, type = "l", xlab = "x", ylab = "F(x)")

  
  # the x bar and confidence interval of x bar
  mean(x)
## [1] 0.0493282
  c(mean(x) - 1.96*sd(x)/sqrt(n), mean(x) + 1.96*sd(x)/sqrt(n))
## [1] -0.2209095  0.3195659

Since we can expect the sampling distribution \(\mathbf{P}_n\) to converge to the true distribution \(\mathbf{P}\) as \(n\) grows, we can then anticipate that random and independent draws with replacement (bootstrap samples) from the distribution \(\mathbf{P}_n\) will also converge to \(\mathbf{P}\). This comes with important consequences that

Bootstrap samples mimics independently observed samples from the original \(\mathbf{P}\)
We can use these bootstrap samples to quantify the variance of any statistic

Here is an example for estimating the variance of \(\bar{x}\), but using bootstrap samples:

  # plot the estimated cdf
  plot(u, Fn, type = "l", xlab = "x", ylab = "F(x)", lwd = 2, col = "red")

  # bootstrapped means
  nbs = 200
  xbarb = rep(0, nbs)
  
  for (i in 1:nbs)
  {
    xb = sample(x, replace = TRUE)
    points(u, mycdf(u, xb), type = "s", col = "gray")
    xbarb[i] = mean(xb)
  }


  # bootstrap estimation of 90% confidence interval of xbar
  quantile(xbarb, c(0.05, 0.95))
##         5%        95% 
## -0.1551600  0.3401761

Using this idea, we can easily implement a bootstrap estimation of the variance for any ATE estimators. The following code is an example for the IPW estimators.

  n = 500
  nbs = 500
  
  mydata = data_gen(n, ps = TRUE, reg = TRUE)

  point.est = ATE_Est(mydata$a, mydata$y, mydata$x)
  
  boot.est = replicate(nbs, {
    id.boot = sample(1:n, n, replace = TRUE )
    ATE_Est(mydata$a[id.boot], mydata$y[id.boot], mydata$x[id.boot, ]) - mydata$tau
    })

  rbind(point.est, 
        "sd" = apply(boot.est, 1, sd))
##                 [,1]      [,2]      [,3]       [,4]
## point.est 0.08127117 0.1065476 0.1349242 0.06750291
## sd        0.10209857 0.2266657 0.2092851 0.11934551

Reference

Bickel, Peter J, Chris AJ Klaassen, Peter J Bickel, Ya’acov Ritov, J Klaassen, Jon A Wellner, and YA’Acov Ritov. 1993. Efficient and Adaptive Estimation for Semiparametric Models. Vol. 4. Springer.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters: Double/Debiased Machine Learning.” The Econometrics Journal 21 (1).

Ding, Peng. 2023. “A First Course in Causal Inference.” arXiv Preprint arXiv:2305.18793.

Kosorok, Michael R. 2008. Introduction to Empirical Processes and Semiparametric Inference. Vol. 61. Springer.

Robins, James M, Andrea Rotnitzky, and Lue Ping Zhao. 1994. “Estimation of Regression Coefficients When Some Regressors Are Not Always Observed.” Journal of the American Statistical Association 89 (427): 846–66.

In fact a convenient way of implementing cross-fitting is to use the out-of-bag mechanism in random forests. Since random forest uses a randomly sampled subset of the data to grow each tree, the prediction of a subject can be done with all trees that do not involve this subject. This is called the out-of-bag prediction, which can be independent from the subject.↩︎

Doubly Robust Estimator for ATE

Ruoqing Zhu

Last Updated: March 22, 2024