\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\cA}{\mathcal{A}}\) \(\newcommand{\cB}{\mathcal{B}}\) \(\newcommand{\cC}{\mathcal{C}}\) \(\newcommand{\cD}{\mathcal{D}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cF}{\mathcal{F}}\) \(\newcommand{\cG}{\mathcal{G}}\) \(\newcommand{\cH}{\mathcal{H}}\) \(\newcommand{\cI}{\mathcal{I}}\) \(\newcommand{\cJ}{\mathcal{J}}\) \(\newcommand{\cK}{\mathcal{K}}\) \(\newcommand{\cL}{\mathcal{L}}\) \(\newcommand{\cM}{\mathcal{M}}\) \(\newcommand{\cN}{\mathcal{N}}\) \(\newcommand{\cO}{\mathcal{O}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\cQ}{\mathcal{Q}}\) \(\newcommand{\cR}{\mathcal{R}}\) \(\newcommand{\cS}{\mathcal{S}}\) \(\newcommand{\cT}{\mathcal{T}}\) \(\newcommand{\cU}{\mathcal{U}}\) \(\newcommand{\cV}{\mathcal{V}}\) \(\newcommand{\cW}{\mathcal{W}}\) \(\newcommand{\cX}{\mathcal{X}}\) \(\newcommand{\cY}{\mathcal{Y}}\) \(\newcommand{\cZ}{\mathcal{Z}}\) \(\newcommand{\bA}{\mathbf{A}}\) \(\newcommand{\bB}{\mathbf{B}}\) \(\newcommand{\bC}{\mathbf{C}}\) \(\newcommand{\bD}{\mathbf{D}}\) \(\newcommand{\bE}{\mathbf{E}}\) \(\newcommand{\bF}{\mathbf{F}}\) \(\newcommand{\bG}{\mathbf{G}}\) \(\newcommand{\bH}{\mathbf{H}}\) \(\newcommand{\bI}{\mathbf{I}}\) \(\newcommand{\bJ}{\mathbf{J}}\) \(\newcommand{\bK}{\mathbf{K}}\) \(\newcommand{\bL}{\mathbf{L}}\) \(\newcommand{\bM}{\mathbf{M}}\) \(\newcommand{\bN}{\mathbf{N}}\) \(\newcommand{\bO}{\mathbf{O}}\) \(\newcommand{\bP}{\mathbf{P}}\) \(\newcommand{\bQ}{\mathbf{Q}}\) \(\newcommand{\bR}{\mathbf{R}}\) \(\newcommand{\bS}{\mathbf{S}}\) \(\newcommand{\bT}{\mathbf{T}}\) \(\newcommand{\bU}{\mathbf{U}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\ba}{\mathbf{a}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\bd}{\mathbf{d}}\) \(\newcommand{\be}{\mathbf{e}}\) \(\newcommand{\bg}{\mathbf{g}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bi}{\mathbf{i}}\) \(\newcommand{\bj}{\mathbf{j}}\) \(\newcommand{\bk}{\mathbf{k}}\) \(\newcommand{\bl}{\mathbf{l}}\) \(\newcommand{\bm}{\mathbf{m}}\) \(\newcommand{\bn}{\mathbf{n}}\) \(\newcommand{\bo}{\mathbf{o}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bq}{\mathbf{q}}\) \(\newcommand{\br}{\mathbf{r}}\) \(\newcommand{\bs}{\mathbf{s}}\) \(\newcommand{\bt}{\mathbf{t}}\) \(\newcommand{\bu}{\mathbf{u}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\RR}{\mathbb{R}}\) \(\newcommand{\NN}{\mathbb{N}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\hpi}{\widehat{\pi}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\hbpi}{\widehat{\boldsymbol{\pi}}}\) \(\newcommand{\bxi}{\boldsymbol{\xi}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bzero}{\mathbf{0}}\) \(\newcommand{\T}{\text{T}}\) \(\newcommand{\Trace}{\text{Trace}}\) \(\newcommand{\Cov}{\text{Cov}}\) \(\newcommand{\Var}{\text{Var}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\Pr}{\text{Pr}}\) \(\newcommand{\pr}{\text{pr}}\) \(\newcommand{\pdf}{\text{pdf}}\) \(\newcommand{\P}{\text{P}}\) \(\newcommand{\p}{\text{p}}\) \(\newcommand{\One}{\mathbf{1}}\) \(\newcommand{\argmin}{\operatorname*{arg\,min}}\) \(\newcommand{\argmax}{\operatorname*{arg\,max}}\) \(\newcommand{\dtheta}{\frac{\partial}{\partial\theta} }\) \(\newcommand{\ptheta}{\nabla_\theta}\) \(\newcommand{\alert}[1]{\color{darkorange}{#1}}\) \(\newcommand{\alertr}[1]{\color{red}{#1}}\) \(\newcommand{\alertb}[1]{\color{blue}{#1}}\)

1 Overview

We have now introduced the inverse probability weighting (IPW) estimator as

\[ \hat{\tau}_{\text{ipw}} = \frac{1}{n} \sum_{i=1}^n \left( \frac{Y_i A_i}{\hat{e}(X_i)} - \frac{Y_i (1 - A_i)}{1 - \hat{e}(X_i)} \right) \tag{1} \]

On the other hand, if we are able to come up with some good regression estimator of the expectation of the outcome \(Y\), say, \(\hat{\mu}_{(1)}(X)\), for those who received treatment, and \(\hat{\mu}_{(0)}(X)\), for those who did not receive treatment, then we can use the regression estimator of the ATE as

\[ \hat{\tau}_{\text{reg}} = \frac{1}{n} \sum_{i=1}^n \left( \hat{\mu}_{(1)}(X_i) - \hat{\mu}_{(0)}(X_i) \right) \tag{2} \]

An interesting idea of combining these two estimators is called an augmented inverse probability weighted (AIPW) estimator (Robins, Rotnitzky, and Zhao 1994), which is defined as

\[ \hat{\tau}_{\text{aipw}} = \frac{1}{n} \sum_{i=1}^n \left( \hat{\mu}_{(1)}(X_i) - \hat{\mu}_{(0)}(X_i) + \frac{A_i}{\hat{e}(X_i)} \left(Y_i - \hat{\mu}_{(1)}(X_i) \right) - \frac{(1 - A_i)}{1 - \hat{e}(X_i)} \left(Y_i - \hat{\mu}_{(0)}(X_i) \right) \right) \tag{3} \]

By just looking at this estimator, it is attempting to do a correction of the regression estimator. For an observation, \(Y_i - \hat{\mu}_{(1)}(X_i)\) is the error of the regression estimator for those who received treatment, and using a weighted version of this quantity would adjust the regression estimator. The same is true for those who did not receive treatment. Interestingly, this estimator displays a doubly robust property, which means that if either the regression estimator or the IPW estimator is consistent, then the AIPW estimator is consistent.

2 Doubly Robust Property

To understand the double robustness property, let’s consider the expectation version. This is reasonably close to the sample average as long as we have a large sample size.

\[ \mathbb{E} \left[ \hat{\mu}_{(1)}(X) + \frac{A}{\hat{e}(X)} \left(Y - \hat{\mu}_{(1)}(X) \right) \right] - \mathbb{E} \left[ \hat{\mu}_{(0)}(X) + \frac{(1 - A)}{1 - \hat{e}(X)} \left(Y - \hat{\mu}_{(0)}(X) \right) \right] \tag{4} \]

For the first term, it should be targeting \(\E [ Y(1) ]\) as an unbiased estimator. Let’s rearrange the first term and look at its difference between the two:

\[ \begin{aligned} &\E \left[ \hat{\mu}_{(1)}(X) + \frac{A}{\hat{e}(X)} \left(Y - \hat{\mu}_{(1)}(X) \right) - Y(1) \right] \\ =& \E \left[ \frac{A}{\hat{e}(X)} \left(Y - \hat{\mu}_{(1)}(X) \right) - \left( Y(1) - \hat{\mu}_{(1)}(X) \right) \right] \\ =& \E \left[ \frac{A - \hat{e}(X)}{\hat{e}(X)} \left( Y(1) - \hat{\mu}_{(1)}(X) \right) \right] \quad \text{by SUTVA} \\ =& \E \left[ \E \left[ \frac{A - \hat{e}(X)}{\hat{e}(X)} \left( Y(1) - \hat{\mu}_{(1)}(X) \right) \Biggm| X \right] \right] \\ =& \E \left[ \E \left[ \frac{A - \hat{e}(X)}{\hat{e}(X)} \Biggm| X \right] \times \E \left[ Y(1) - \hat{\mu}_{(1)}(X) \Biggm| X \right] \right] \quad \text{by Unconfoundedness}\\ =& \E \left[ \frac{e(X) - \hat{e}(X)}{\hat{e}(X)} \times \left[ \mu_{(1)}(X) - \hat{\mu}_{(1)}(X) \right] \right] \end{aligned} \]

Here, we assume that the two plugin estimations are hold fixed (or independent from \(Y\) and \(A\)), but this needs to be discussed in the cross-fitting part later. Then, it is easy to see that whenever \(\hat{\mu}_{(1)}(X)\) or \(\hat{e}(X)\) is unbiased, the expectation is zero, meaning that the estimator is unbiased. Derivation of the second term is almost identical. It maybe interesting to establish an upper bound of the bias. If we apply the Cauchy-Schwarz inequality,

\[ |\E(X Y)| \leq \sqrt{\E(X^2) \E(Y^2)} \]

We can have, for any given \(X\),

\[ \begin{aligned} & \biggm| \mathbb{E} \left[ \frac{e(X) - \hat{e}(X)}{\hat{e}(X)} \times \left[ \mu(1) - \hat{\mu}_{(1)}(X) \right] \right] \biggm| \\ \leq & \sqrt{\mathbb{E} \left[ \left( \frac{e(X) - \hat{e}(X)}{\hat{e}(X)} \right)^2 \right] \times \mathbb{E} \left[ \left( \mu(1) - \hat{\mu}_{(1)}(X) \right)^2 \right]} \end{aligned} \]

This suggests that the bias of the estimator is upper bounded by the product of the biases of \(\hat{e}(X)\) and \(\hat{\mu}_{(1)}(X)\). Hence as long as one of the two estimators has zero bias, the AIPW estimator also has zero bias.

An alternative view of this bias reduction can be seen from re-organize the terms in (4):

\[ \mathbb{E} \left[ \hat{\mu}_{(1)}(X) - \hat{\mu}_{(0)}(X) \right] + \mathbb{E} \left[ \frac{A}{\hat{e}(X)} \left(Y - \hat{\mu}_{(1)}(X) \right) + \frac{(1 - A)}{1 - \hat{e}(X)} \left(Y - \hat{\mu}_{(0)}(X) \right) \right] \]

In this case, the first term \(\mathbb{E} \left[ \hat{\mu}_{(1)}(X) - \hat{\mu}_{(0)}(X) \right]\) as the regression estimator of the ATE, while the second term is some correction term for the bias. Each \(Y - \hat{\mu}(X)\) term is representing some bias of the corresponding \(X\) and they are weighted by the propensity score.

3 Variance of the AIPW Estimator

The variance of the AIPW estimator is a bit more complicated. Let’s re-organize the terms in (4) a bit before the analysis

\[ \mathbb{E} \left[ \hat{\mu}_{(1)}(X) - \hat{\mu}_{(0)}(X) \right] + \mathbb{E} \left[ \frac{A}{\hat{e}(X)} \left(Y - \hat{\mu}_{(1)}(X) \right) - \frac{(1 - A)}{1 - \hat{e}(X)} \left(Y - \hat{\mu}_{(0)}(X) \right) \right] \]

We typically would expect the first term to have very small variance (even if they are biased) because they are essentially mean of some type of mean quantity. Hence the variance would mainly come from the second term. We can analyze the part associated with \(A = 1\):

\[ \begin{aligned} & \Var \left[ \frac{A}{\hat{e}(X)} \left( Y - \hat{\mu}_{(1)}(X) \right) \right] \\ =& \Var \left[ \frac{A}{\hat{e}(X)} \left( Y(1) - \hat{\mu}_{(1)}(X) \right) \right] \quad \text{by SUTVA} \\ \leq & \E \left[ \frac{A}{\hat{e}^2(X)} \left( Y(1) - \hat{\mu}_{(1)}(X) \right)^2 \right] \quad \text{by}\,\, \Var(X) = \E(X^2) - \E^2(X)\\ = & \E \left[ \frac{e(X)}{\hat{e}^2(X)} \times \E \left[ \left( Y(1) - \hat{\mu}_{(1)}(X) \right)^2 \biggm| X \right] \right] \quad \text{by Unconfoundedness} \end{aligned} \]

This would be usually much smaller than not subtracting \(Y(1)\) by a conditional mean regression estimator. Of course this may still be affected by many factors, the variance of the AIPW estimator is typically smaller than the variance of the IPW estimator.

4 Example: Comparing Different Estimators

Let’s compare several estimators we have covered so far:

  • The regression based estimator given in (2)
  • The original IPW estimator given in (1)
  • The Hajek estimator (see homework)
  • The doubly robust estimator given in (3)

The following example is modified from Ding (2023).

  ATE_Est <- function(a, y, x, 
                      truncps = c(0, 1)) # truncate propensity score if needed (positivity assumption)
  {
    ate_data = data.frame("a" = a, "y" = y, x)
    
    # fitted propensity score
    pscore <- glm(a ~ . - y, data = ate_data, family = "binomial")$fitted.values
    pscore <- pmax(truncps[1], pmin(truncps[2], pscore))
    
    # fitted potential outcomes
    # weights simply restrict to part of the data based on the label by still predict all subjects
    outcome1 <- lm(y ~ . - a, data = ate_data, weights = a)$fitted.values
    outcome0 <- lm(y ~ . - a, data = ate_data, weights = (1 - a))$fitted.values
    
    # outcome regression estimator
    ate_reg <- mean(outcome1 - outcome0)
    
    # IPW estimators
    y_treat <- mean(a * y / pscore)
    y_control <- mean((1 - a) * y / (1 - pscore))
    ate_ipw <- y_treat - y_control
    
    # Hajek estimator (see HW1)
    one_treat <- mean(a / pscore)
    one_control <- mean((1 - a) / (1 - pscore))
    ate_hajek <- y_treat / one_treat - y_control / one_control
    
    # doubly robust estimator
    res1 <- y - outcome1
    res0 <- y - outcome0
    r_treat <- mean(a * res1 / pscore) # residual correction term for treated
    r_control <- mean((1 - a) * res0 / (1 - pscore)) # residual correction term for control
    ate_dr <- ate_reg + r_treat - r_control
    
    return(c(ate_reg, ate_ipw, ate_hajek, ate_dr))
  }

We will consider four different cases to compare these estimators:

  • When both the propensity score and the outcome model are correctly specified
  • When the propensity score is incorrectly specified
  • When the outcome model is incorrectly specified
  • When both the propensity score and the outcome model are incorrectly specified
  n = 500
  nsim = 1000
  set.seed(1)
  
  data_gen <- function(n, 
                       ps = TRUE, # TRUE/FALSE for correct propensity score
                       reg = TRUE) # TRUE/FALSE for correct regression model
  {
        # two dimensional covariates plus intercept
    x <- matrix(rnorm(n * 2), n, 2)
    
    if (ps) # TRUE/FALSE for correct propensity score
    {
      # generate propensity score and treatment label
      beta_ps <- c(1, 1)
      pscore <- 1 / (1 + exp(-as.vector(x %*% beta_ps)))
      a <- rbinom(n, 1, pscore)
    }else{
      # nonlinear propensity score
      x1 = cbind(x, exp(x))
      beta_ps <- c (0, 0, 1, -1)
      pscore <- 1 / (1 + exp(1 - as.vector(x1 %*% beta_ps)))
      a <- rbinom(n, 1, pscore)
    }
    
    if (reg) # TRUE/FALSE for correct regression model
    {
      # generate potential outcomes
      beta_y1 <- c(2, 1)
      beta_y0 <- c(2, 1)
      y1 <- rnorm(n, 1 + x %*% beta_y1)
      y0 <- rnorm(n, 1 + x %*% beta_y0)
    }else{
      # generate potential outcomes
      x1 = cbind(x, exp(x))
      beta_y1 <- c(1, -1, 0.5, -0.2)
      beta_y0 <- c(-1, 1, -0.5, 0.2)
      y1 <- rnorm(n, 1 + x1 %*% beta_y1)
      y0 <- rnorm(n, 1 + x1 %*% beta_y0)
    }

    # observed outcome
    y <- a * y1 + (1 - a) * y0    
    
    return(list(a = a, y = y, x = x, 
                tau = mean(y1 - y0))) # tau is the true ATE for this dataset
  }

4.1 Case 1: All Correctly Specified

In this case, the data generator is exactly what our model is. Hence, the doubly robust estimator should perform well. It should also be noticed that since the regression estimator is also correct and pretty much optimal (root-\(n\) sense), there is really not much to improve from it.

  result_pool = matrix(NA, nsim, 4)
  
  for (k in 1:nsim)
  {
    mydata = data_gen(n, ps = TRUE, reg = TRUE)
    
    # estimated values
    result_pool[k, ] = ATE_Est(mydata$a, mydata$y, mydata$x) - mydata$tau
  }
  
  # summarize results from all simulations 
  summary_mat = rbind(apply(result_pool, 2, mean),
                      apply(result_pool, 2, sd))
  
  rownames(summary_mat) <- c("estimate", "se")
  colnames(summary_mat) <- c("reg", "ipw", "Hajek", "DR")
  round(summary_mat, 5)
##              reg      ipw   Hajek      DR
## estimate 0.00150 -0.00357 0.01170 0.00563
## se       0.08714  0.35163 0.29993 0.10571

4.2 Case 2: Incorrectly Specified Propensity Score

For this case, we specify a different data generator for the propensity score. Hence our working model would be specified wrong. The IPW estimator can be very biased in this case.

  result_pool = matrix(NA, nsim, 4)
  
  for (k in 1:nsim)
  {
    mydata = data_gen(n, ps = FALSE, reg = TRUE)
    
    # estimated values
    result_pool[k, ] = ATE_Est(mydata$a, mydata$y, mydata$x) - mydata$tau
  }
  
  # summarize results from all simulations 
  summary_mat = rbind(apply(result_pool, 2, mean),
                      apply(result_pool, 2, sd))
  
  rownames(summary_mat) <- c("estimate", "se")
  colnames(summary_mat) <- c("reg", "ipw", "Hajek", "DR")
  round(summary_mat, 5)
##               reg      ipw    Hajek       DR
## estimate -0.00470 -0.68136 -0.64346 -0.01031
## se        0.10487  0.71156  0.50183  0.18670

4.3 Case 3: Incorrectly Specified Outcome Model

The outcome model is modified to include nonlinear terms. In this case, the regression estimator would be biased, and the DR estimator is also slightly affected by that bias. However, the IPW and Hajek estimators are still unbiased because they are not affected by the regression model. Doubly robust estimator is still unbiased.

  result_pool = matrix(NA, nsim, 4)
  
  for (k in 1:nsim)
  {
    mydata = data_gen(n, ps = TRUE, reg = FALSE)
    
    # estimated values
    result_pool[k, ] = ATE_Est(mydata$a, mydata$y, mydata$x) - mydata$tau
  }
  
  # summarize results from all simulations 
  summary_mat = rbind(apply(result_pool, 2, mean),
                      apply(result_pool, 2, sd))
  
  rownames(summary_mat) <- c("estimate", "se")
  colnames(summary_mat) <- c("reg", "ipw", "Hajek", "DR")
  round(summary_mat, 5)
##               reg      ipw    Hajek       DR
## estimate -0.13398 -0.00158 -0.00435 -0.00228
## se        0.11941  0.31770  0.29824  0.20697

4.4 Case 4: Incorrectly Specified Outcome Model and Propensity Score

In this case, all estimators are very bad. Hence there is a price to pay for being doubly robust. It can in fact be doubly fragile.

  result_pool = matrix(NA, nsim, 4)
  
  for (k in 1:nsim)
  {
    mydata = data_gen(n, ps = FALSE, reg = FALSE)
    
    # estimated values
    result_pool[k, ] = ATE_Est(mydata$a, mydata$y, mydata$x) - mydata$tau
  }
  
  # summarize results from all simulations 
  summary_mat = rbind(apply(result_pool, 2, mean),
                      apply(result_pool, 2, sd))
  
  rownames(summary_mat) <- c("estimate", "se")
  colnames(summary_mat) <- c("reg", "ipw", "Hajek", "DR")
  round(summary_mat, 5)
##               reg      ipw    Hajek      DR
## estimate -0.20039 -0.26808 -0.38877 0.32407
## se        0.16309  0.48331  0.46917 0.51623

5 Cross-fitting in Doubly Robust Estimation

There is actually a little catch in our previous discussion about the doubly robust estimator. The estimator is doubly robust if the errors of the propensity score and the outcome model are independent from the observations that are used for estimating the ATE. However, since we do not have extra samples to estimate these quantities, we can consider cross-fitting in practice. To illustrate the idea, let’s consider splitting the data into two halves. In equation (3), we estimate all \(\hat{e}(X)\) and \(\hat{\mu}(X)\)’s using the first half of the data, and then use them as plug-in estimators to calculate the terms in the ATE estimato. Then, we can also reverse the roles of the two halves of the data, and take the sum of all terms. In this case, all the \(n\) terms in the estimator have their \(Y_i\), \(A_i\) and \(X_i\)’s independent from the estimators \(\hat{e}(X)\) and \(\hat{\mu}(X)\). This phenomenon is a result from the semiparametric efficiency theory which is an advanced statistical topic Bickel et al. (1993). More details can be found in the text book Kosorok (2008).

The following updated code implements this idea. However, for this case, it does not really improve the performance (standard error) much mainly because the \(\hat{e}(X)\) and \(\hat{\mu}(X)\) estimators are both theoretically root-\(n\) speed already since they are simple parametric models. In cases when there are more general machine learning models being used (Chernozhukov et al. 2018), the effect of cross-fitting can be seen more clearly. To highly the results from semiparametric efficiency theory, we would only require \(\hat{e}(X)\) and \(\hat{\mu}(X)\) to be \(\sqrt[4]n\) consistency to ensure the doubly robust estimator to be root-\(n\) consistent. This is a much less stringent requirement than what we typically need. 1

  dr_cross <- function(a, y, x, idx)
  {
    alldata = data.frame(a, y, x)
    data_plugin = alldata[-idx, ]
    data_est = alldata[idx, ]

    # fitted propensity score
    ps_model <- glm(a ~ . - y, data = data_plugin, family = "binomial")
    pscore_est <- predict(ps_model, newdata = data_est, type = "response")
    
    # fitted potential outcomes
    outcome_model1 <- glm(y ~ . - a, data = data_plugin, weights = a, family = "gaussian")
    outcome_est1 <- predict(outcome_model1, newdata = data_est)
    outcome_model0 <- glm(y ~ . - a, data = data_plugin, weights = (1 - a), family = "gaussian")
    outcome_est0 <- predict(outcome_model0, newdata = data_est)
    
    # doubly robust estimator
    res1_est <- data_est$y - outcome_est1
    res0_est <- data_est$y - outcome_est0
    r_treat_est <- mean(data_est$a * res1_est / pscore_est) # residual correction term for treated
    r_control_est <- mean((1 - data_est$a) * res0_est / (1 - pscore_est)) # residual correction term for control
    
    dr_crossfit <- mean(outcome_est1 - outcome_est0) + r_treat_est - r_control_est
    
    # estimated values
    return(dr_crossfit)
  }
  result_drcross = rep(NA, nsim)
  
  for (k in 1:nsim)
  {
    mydata = data_gen(n, ps = TRUE, reg = TRUE)
    
    # fit the cross-fitted doubly robust estimator
    M = 5 # use a 5 fold version
    allidx = sample(1:n, n)
    index <- seq_along(allidx)
    factor_levels <- cut(index, breaks=M, labels=FALSE)
    chunks <- split(allidx, factor_levels)

    cross_est_fold = rep(NA, M)
    for (j in 1:M)
    {
      cross_est_fold[j] = dr_cross(mydata$a, mydata$y, mydata$x, chunks[[j]])
    }

    # estimated values
    result_drcross[k] = mean(cross_est_fold) - mydata$tau
  }
  
  # summarize results from all simulations 
  mean(result_drcross)
## [1] 0.01099088
  sd(result_drcross)
## [1] 0.117282

6 Bootstrap Estimation of Variance

Sometimes it could be difficult to derive the variance of an ATE estimator analytically. However, as we introduced previously, the bootstrap method can be used to estimate the variance of an estimator by re-sampling the data with replacement and re-calculating the estimator on each bootstrap sample. Using this idea, we can easily implement a bootstrap estimation of the variance for any ATE estimators. The following code is an example for the IPW estimators.

  n = 500
  nbs = 500
  
  mydata = data_gen(n, ps = TRUE, reg = TRUE)

  point.est = ATE_Est(mydata$a, mydata$y, mydata$x)
  
  boot.est = replicate(nbs, {
    id.boot = sample(1:n, n, replace = TRUE )
    ATE_Est(mydata$a[id.boot], mydata$y[id.boot], mydata$x[id.boot, ]) - mydata$tau
    })

  rbind(point.est, 
        "sd" = apply(boot.est, 1, sd))
##                   [,1]      [,2]       [,3]       [,4]
## point.est -0.008017671 -1.050069 -0.8889289 -0.3034686
## sd         0.098965547  1.259396  0.7804645  0.2444319

7 Estimating ATE with grf Package

Recall our previous lecture on the random forest estimation with influence functions. The grf treats the doubly robust estimator as a efficient influence function estimation. Here, we update our previous formula explicitly into the doubly robust version

\[ \psi(Y_i, A_i, X_i; \mu_0, \mu_1, e) = \left[ \mu_1(X_i) - \mu_0(X_i) \right] + \frac{A_i}{e(X_i)} (Y_i - \mu_1(X_i)) - \frac{1 - A_i}{1 - e(X_i)} (Y_i - \mu_0(X_i)) - \tau(X) \]

And we proceed with the forest weighted moment condition to get the local estimator of \(\tau(x)\):

\[ \sum_{i=1}^n w_i(x) \psi(Y_i, A_i, X_i; \mu_0, \mu_1, e) = 0 \]

with \(\mu_0\), \(\mu_1\) and \(e\) being estimated from the data. The following code illustrates how to use the grf package to estimate the ATE with doubly robust estimation, on the lalonde dataset:

  library(grf)
  library(Matching)
## Loading required package: MASS
## ## 
## ##  Matching (Version 4.10-15, Build Date: 2024-10-14)
## ##  See https://www.jsekhon.com for additional documentation.
## ##  Please cite software as:
## ##   Jasjeet S. Sekhon. 2011. ``Multivariate and Propensity Score Matching
## ##   Software with Automated Balance Optimization: The Matching package for R.''
## ##   Journal of Statistical Software, 42(7): 1-52. 
## ##
  data("lalonde")
  
  # prepare data
  y = lalonde$re78
  a = lalonde$treat
  x = lalonde[, !(names(lalonde) %in% c("re78", "treat"))]
  
  # fit the causal forest with doubly robust estimation
  cf_dr = causal_forest(x, y, a, 
                        Y.hat = NULL, W.hat = NULL, # let grf estimate the nuisance functions
                        num.trees = 2000)
  
  # estimate ATE
  ate_dr_grf = average_treatment_effect(cf_dr, target.sample = "all")
  ate_dr_grf
##  estimate   std.err 
## 1549.7443  669.5565

The estimation here would include all samples and estimate the ATE when applying the treatment on all of them. If one wants to consider just the ones being treated, then target.sample = "treated" can be used.

  # estimate ATE for treated only
  ate_dr_grf_treated = average_treatment_effect(cf_dr, target.sample = "treated")
  ate_dr_grf_treated
##  estimate   std.err 
## 1725.7991  687.4033

Bickel, Peter J, Chris AJ Klaassen, Peter J Bickel, Ya’acov Ritov, J Klaassen, Jon A Wellner, and YA’Acov Ritov. 1993. Efficient and Adaptive Estimation for Semiparametric Models. Vol. 4. Springer.
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters: Double/Debiased Machine Learning.” The Econometrics Journal 21 (1).
Ding, Peng. 2023. “A First Course in Causal Inference.” arXiv Preprint arXiv:2305.18793.
Kosorok, Michael R. 2008. Introduction to Empirical Processes and Semiparametric Inference. Vol. 61. Springer.
Robins, James M, Andrea Rotnitzky, and Lue Ping Zhao. 1994. “Estimation of Regression Coefficients When Some Regressors Are Not Always Observed.” Journal of the American Statistical Association 89 (427): 846–66.

  1. In fact a convenient way of implementing cross-fitting is to use the out-of-bag mechanism in random forests. Since random forest uses a randomly sampled subset of the data to grow each tree, the prediction of a subject can be done with all trees that do not involve this subject. This is called the out-of-bag prediction, which can be independent from the subject.↩︎