We have now introduced the inverse probability weighting (IPW) estimator as
\[ \hat{\tau}_{\text{ipw}} = \frac{1}{n} \sum_{i=1}^n \left( \frac{Y_i A_i}{\hat{e}(X_i)} - \frac{Y_i (1 - A_i)}{1 - \hat{e}(X_i)} \right) \tag{1} \]
On the other hand, if we are able to come up with some good regression estimator of the expectation of the outcome \(Y\), say, \(\hat{\mu}_{(1)}(X)\), for those who received treatment, and \(\hat{\mu}_{(0)}(X)\), for those who did not receive treatment, then we can use the regression estimator of the ATE as
\[ \hat{\tau}_{\text{reg}} = \frac{1}{n} \sum_{i=1}^n \left( \hat{\mu}_{(1)}(X_i) - \hat{\mu}_{(0)}(X_i) \right) \tag{2} \]
An interesting idea of combining these two estimators is called an augmented inverse probability weighted (AIPW) estimator (Robins, Rotnitzky, and Zhao 1994), which is defined as
\[ \hat{\tau}_{\text{aipw}} = \frac{1}{n} \sum_{i=1}^n \left( \hat{\mu}_{(1)}(X_i) - \hat{\mu}_{(0)}(X_i) + \frac{A_i}{\hat{e}(X_i)} \left(Y_i - \hat{\mu}_{(1)}(X_i) \right) - \frac{(1 - A_i)}{1 - \hat{e}(X_i)} \left(Y_i - \hat{\mu}_{(0)}(X_i) \right) \right) \tag{3} \]
By just looking at this estimator, it is attempting to do a correction of the regression estimator. For an observation, \(Y_i - \hat{\mu}_{(1)}(X_i)\) is the error of the regression estimator for those who received treatment, and using a weighted version of this quantity would adjust the regression estimator. The same is true for those who did not receive treatment. Interestingly, this estimator displays a doubly robust property, which means that if either the regression estimator or the IPW estimator is consistent, then the AIPW estimator is consistent.
To understand the double robustness property, let’s consider the expectation version. This is reasonably close to the sample average as long as we have a large sample size.
\[ \mathbb{E} \left[ \hat{\mu}_{(1)}(X) + \frac{A}{\hat{e}(X)} \left(Y - \hat{\mu}_{(1)}(X) \right) \right] - \mathbb{E} \left[ \hat{\mu}_{(0)}(X) + \frac{(1 - A)}{1 - \hat{e}(X)} \left(Y - \hat{\mu}_{(0)}(X) \right) \right] \tag{4} \]
For the first term, it should be targeting \(\E [ Y(1) ]\) as an unbiased estimator. Let’s rearrange the first term and look at its difference between the two:
\[ \begin{aligned} &\E \left[ \hat{\mu}_{(1)}(X) + \frac{A}{\hat{e}(X)} \left(Y - \hat{\mu}_{(1)}(X) \right) - Y(1) \right] \\ =& \E \left[ \frac{A}{\hat{e}(X)} \left(Y - \hat{\mu}_{(1)}(X) \right) - \left( Y(1) - \hat{\mu}_{(1)}(X) \right) \right] \\ =& \E \left[ \frac{A - \hat{e}(X)}{\hat{e}(X)} \left( Y(1) - \hat{\mu}_{(1)}(X) \right) \right] \quad \text{by SUTVA} \\ =& \E \left[ \E \left[ \frac{A - \hat{e}(X)}{\hat{e}(X)} \left( Y(1) - \hat{\mu}_{(1)}(X) \right) \Biggm| X \right] \right] \\ =& \E \left[ \E \left[ \frac{A - \hat{e}(X)}{\hat{e}(X)} \Biggm| X \right] \times \E \left[ Y(1) - \hat{\mu}_{(1)}(X) \Biggm| X \right] \right] \quad \text{by Unconfoundedness}\\ =& \E \left[ \frac{e(X) - \hat{e}(X)}{\hat{e}(X)} \times \left[ \mu(1) - \hat{\mu}_{(1)}(X) \right] \right] \end{aligned} \]
Then, it is easy to see that whenever \(\hat{\mu}_{(1)}(X)\) or \(\hat{e}(X)\) is unbiased, the expectation is zero, meaning that the estimator is unbiased. Derivation of the second term is almost identical. It maybe interesting to establish an upper bound of the bias. If we apply the Cauchy-Schwarz inequality,
\[ |\E(X Y)| \leq \sqrt{\E(X^2) \E(Y^2)} \]
We can have
\[ \begin{aligned} & \biggm| \mathbb{E} \left[ \frac{e(X) - \hat{e}(X)}{\hat{e}(X)} \times \left[ \mu(1) - \hat{\mu}_{(1)}(X) \right] \right] \biggm| \\ \leq & \sqrt{\mathbb{E} \left[ \left( \frac{e(X) - \hat{e}(X)}{\hat{e}(X)} \right)^2 \right] \times \mathbb{E} \left[ \left( \mu(1) - \hat{\mu}_{(1)}(X) \right)^2 \right]} \end{aligned} \]
This suggests that the bias of the estimator is upper bounded by the product of the biases of \(\hat{e}(X)\) and \(\hat{\mu}_{(1)}(X)\). Hence as long as one of the two estimators has zero bias, the AIPW estimator also has zero bias.
An alternative view of this bias reduction can be seen from re-organize the terms in (4):
\[ \mathbb{E} \left[ \hat{\mu}_{(1)}(X) - \hat{\mu}_{(0)}(X) \right] + \mathbb{E} \left[ \frac{A}{\hat{e}(X)} \left(Y - \hat{\mu}_{(1)}(X) \right) + \frac{(1 - A)}{1 - \hat{e}(X)} \left(Y - \hat{\mu}_{(0)}(X) \right) \right] \]
In this case, the first term \(\mathbb{E} \left[ \hat{\mu}_{(1)}(X) - \hat{\mu}_{(0)}(X) \right]\) as the regression estimator of the ATE, while the second term is some correction term for the bias. Each \(Y - \hat{\mu}(X)\) term is representing some bias of the corresponding \(X\) and they are weighted by the propensity score.
The variance of the AIPW estimator is a bit more complicated. Let’s re-organize the terms in (4) a bit before the analysis
\[ \mathbb{E} \left[ \hat{\mu}_{(1)}(X) - \hat{\mu}_{(0)}(X) \right] + \mathbb{E} \left[ \frac{A}{\hat{e}(X)} \left(Y - \hat{\mu}_{(1)}(X) \right) - \frac{(1 - A)}{1 - \hat{e}(X)} \left(Y - \hat{\mu}_{(0)}(X) \right) \right] \]
We typically would expect the first term to have very small variance (even if they are biased) because they are essentially mean of some type of mean quantity. Hence the variance would mainly come from the second term. We can analyze the part associated with \(A = 1\):
\[ \begin{aligned} & \Var \left[ \frac{A}{\hat{e}(X)} \left( Y - \hat{\mu}_{(1)}(X) \right) \right] \\ =& \Var \left[ \frac{A}{\hat{e}(X)} \left( Y(1) - \hat{\mu}_{(1)}(X) \right) \right] \quad \text{by SUTVA} \\ \leq & \E \left[ \frac{A}{\hat{e}^2(X)} \left( Y(1) - \hat{\mu}_{(1)}(X) \right)^2 \right] \quad \text{by}\,\, \Var(X) = \E(X^2) - \E^2(X)\\ = & \E \left[ \frac{e(X)}{\hat{e}^2(X)} \times \E \left[ \left( Y(1) - \hat{\mu}_{(1)}(X) \right)^2 \biggm| X \right] \right] \quad \text{by Unconfoundedness} \end{aligned} \]
This would be usually much smaller than not subtracting \(Y(1)\) by a conditional mean regression estimator. Of course this may still be affected by many factors, the variance of the AIPW estimator is typically smaller than the variance of the IPW estimator.
Let’s compare several estimators we have covered so far:
The following example is modified from from Ding (2023).
ATE_Est <- function(a, y, x,
truncps = c(0, 1)) # truncate propensity score if needed (positivity assumption)
{
ate_data = data.frame("a" = a, "y" = y, x)
# fitted propensity score
pscore <- glm(a ~ . - y, data = ate_data, family = "binomial")$fitted.values
pscore <- pmax(truncps[1], pmin(truncps[2], pscore))
# fitted potential outcomes
# weights simply restrict to part of the data based on the label by still predict all subjects
outcome1 <- lm(y ~ . - a, data = ate_data, weights = a)$fitted.values
outcome0 <- lm(y ~ . - a, data = ate_data, weights = (1 - a))$fitted.values
# outcome regression estimator
ate_reg <- mean(outcome1 - outcome0)
# IPW estimators
y_treat <- mean(a * y / pscore)
y_control <- mean((1 - a) * y / (1 - pscore))
ate_ipw <- y_treat - y_control
# Hajek estimator (see HW1)
one_treat <- mean(a / pscore)
one_control <- mean((1 - a) / (1 - pscore))
ate_hajek <- y_treat / one_treat - y_control / one_control
# doubly robust estimator
res1 <- y - outcome1
res0 <- y - outcome0
r_treat <- mean(a * res1 / pscore) # residual correction term for treated
r_control <- mean((1 - a) * res0 / (1 - pscore)) # residual correction term for control
ate_dr <- ate_reg + r_treat - r_control
return(c(ate_reg, ate_ipw, ate_hajek, ate_dr))
}
We will consider four different cases to compare these estimators:
n = 500
nsim = 1000
set.seed(1)
data_gen <- function(n,
ps = TRUE, # TRUE/FALSE for correct propensity score
reg = TRUE) # TRUE/FALSE for correct regression model
{
# two dimensional covariates plus intercept
x <- matrix(rnorm(n * 2), n, 2)
if (ps) # TRUE/FALSE for correct propensity score
{
# generate propensity score and treatment label
beta_ps <- c(1, 1)
pscore <- 1 / (1 + exp(-as.vector(x %*% beta_ps)))
a <- rbinom(n, 1, pscore)
}else{
# nonlinear propensity score
x1 = cbind(x, exp(x))
beta_ps <- c (0, 0, 1, -1)
pscore <- 1 / (1 + exp(1 - as.vector(x1 %*% beta_ps)))
a <- rbinom(n, 1, pscore)
}
if (reg) # TRUE/FALSE for correct regression model
{
# generate potential outcomes
beta_y1 <- c(2, 1)
beta_y0 <- c(2, 1)
y1 <- rnorm(n, 1 + x %*% beta_y1)
y0 <- rnorm(n, 1 + x %*% beta_y0)
}else{
# generate potential outcomes
x1 = cbind(x, exp(x))
beta_y1 <- c(1, -1, 0.5, -0.2)
beta_y0 <- c(-1, 1, -0.5, 0.2)
y1 <- rnorm(n, 1 + x1 %*% beta_y1)
y0 <- rnorm(n, 1 + x1 %*% beta_y0)
}
# observed outcome
y <- a * y1 + (1 - a) * y0
return(list(a = a, y = y, x = x,
tau = mean(y1 - y0))) # tau is the true ATE for this dataset
}
In this case, the data generator is exactly what our model is. Hence, the doubly robust estimator should perform very well. It should also be noticed that since the regression estimator is also correct and pretty much optimal (root-\(n\) sense), there is really not much to improve from it.
result_pool = matrix(NA, nsim, 4)
for (k in 1:nsim)
{
mydata = data_gen(n, ps = TRUE, reg = TRUE)
# estimated values
result_pool[k, ] = ATE_Est(mydata$a, mydata$y, mydata$x) - mydata$tau
}
# summarize results from all simulations
summary_mat = rbind(apply(result_pool, 2, mean),
apply(result_pool, 2, sd))
rownames(summary_mat) <- c("estimate", "se")
colnames(summary_mat) <- c("reg", "ipw", "Hajek", "DR")
round(summary_mat, 5)
## reg ipw Hajek DR
## estimate 0.00150 -0.00357 0.01170 0.00563
## se 0.08714 0.35163 0.29993 0.10571
For this case, we specify a different data generator for the propensity score. Hence our working model would be specified wrong. The IPW estimator can be very biased in this case.
result_pool = matrix(NA, nsim, 4)
for (k in 1:nsim)
{
mydata = data_gen(n, ps = FALSE, reg = TRUE)
# estimated values
result_pool[k, ] = ATE_Est(mydata$a, mydata$y, mydata$x) - mydata$tau
}
# summarize results from all simulations
summary_mat = rbind(apply(result_pool, 2, mean),
apply(result_pool, 2, sd))
rownames(summary_mat) <- c("estimate", "se")
colnames(summary_mat) <- c("reg", "ipw", "Hajek", "DR")
round(summary_mat, 5)
## reg ipw Hajek DR
## estimate -0.00470 -0.68136 -0.64346 -0.01031
## se 0.10487 0.71156 0.50183 0.18670
The outcome model is modified to include nonlinear terms. In this case, the regression estimator would be biased, and the DR estimator is also slightly affected by that bias. However, the IPW and Hajek estimators are still unbiased because they are not affected by the regression model. Doubly robust estimator is still unbiased.
result_pool = matrix(NA, nsim, 4)
for (k in 1:nsim)
{
mydata = data_gen(n, ps = TRUE, reg = FALSE)
# estimated values
result_pool[k, ] = ATE_Est(mydata$a, mydata$y, mydata$x) - mydata$tau
}
# summarize results from all simulations
summary_mat = rbind(apply(result_pool, 2, mean),
apply(result_pool, 2, sd))
rownames(summary_mat) <- c("estimate", "se")
colnames(summary_mat) <- c("reg", "ipw", "Hajek", "DR")
round(summary_mat, 5)
## reg ipw Hajek DR
## estimate -0.13398 -0.00158 -0.00435 -0.00228
## se 0.11941 0.31770 0.29824 0.20697
In this case, all estimators are very bad. Hence there is a price to pay for being doubly robust. It can in fact be doubly fragile.
result_pool = matrix(NA, nsim, 4)
for (k in 1:nsim)
{
mydata = data_gen(n, ps = FALSE, reg = FALSE)
# estimated values
result_pool[k, ] = ATE_Est(mydata$a, mydata$y, mydata$x) - mydata$tau
}
# summarize results from all simulations
summary_mat = rbind(apply(result_pool, 2, mean),
apply(result_pool, 2, sd))
rownames(summary_mat) <- c("estimate", "se")
colnames(summary_mat) <- c("reg", "ipw", "Hajek", "DR")
round(summary_mat, 5)
## reg ipw Hajek DR
## estimate -0.20039 -0.26808 -0.38877 0.32407
## se 0.16309 0.48331 0.46917 0.51623
There is actually a little catch in our previous discussion about the doubly robust estimator. The estimator is only doubly robust if the propensity score and the outcome model are estimated separately. Another way to say this is that it would be ideal if the errors of the propensity score and the outcome model are independent from the observations that are used for estimating the ATE. To make this more concrete, in equation @(eq:aipw), we would want the estimators \(\hat{e}(X)\) and \(\hat{\mu}(X)\) to be obtained based on samples independent from the samples \(\{X_i, A_i, Y_i\}_{i=1}^n\). These estimators of \(\hat{e}(X)\) and \(\hat{\mu}(X)\) are called plug-in estimators, and we want to use a separate sample to estimate them. Hence the only option here is to split the original sample into two parts, one for estimating the propensity score and the outcome model, and after that, we use the other half of the sample to perform the AIPW estimation. This phenomenon is a result from the semiparametric efficiency theory which is an advanced statistical topic Bickel et al. (1993). More details can be found in the text book Kosorok (2008).
The following updated code implements this idea. However, for this case, it does not really improve the performance (standard error) much mainly because the \(\hat{e}(X)\) and \(\hat{\mu}(X)\) estimators are both theoretically root-\(n\) speed already since they are simple parametric models. In cases when there are more general machine learning models being used (Chernozhukov et al. 2018), the effect of cross-fitting can be seen more clearly. To highly the results from semiparametric efficiency theory, we would only require \(\hat{e}(X)\) and \(\hat{\mu}(X)\) to be \(\sqrt[4]n\) consistency to ensure the doubly robust estimator to be root-\(n\) consistent. This is a much less stringent requirement than what we typically need. 1
dr_cross <- function(a, y, x, idx)
{
alldata = data.frame(a, y, x)
data_plugin = alldata[-idx, ]
data_est = alldata[idx, ]
# fitted propensity score
ps_model <- glm(a ~ . - y, data = data_plugin, family = "binomial")
pscore_est <- predict(ps_model, newdata = data_est, type = "response")
# fitted potential outcomes
outcome_model1 <- glm(y ~ . - a, data = data_plugin, weights = a, family = "gaussian")
outcome_est1 <- predict(outcome_model1, newdata = data_est)
outcome_model0 <- glm(y ~ . - a, data = data_plugin, weights = (1 - a), family = "gaussian")
outcome_est0 <- predict(outcome_model0, newdata = data_est)
# doubly robust estimator
res1_est <- data_est$y - outcome_est1
res0_est <- data_est$y - outcome_est0
r_treat_est <- mean(data_est$a * res1_est / pscore_est) # residual correction term for treated
r_control_est <- mean((1 - data_est$a) * res0_est / (1 - pscore_est)) # residual correction term for control
dr_crossfit <- mean(outcome_est1 - outcome_est0) + r_treat_est - r_control_est
# estimated values
return(dr_crossfit)
}
result_drcross = rep(NA, nsim)
for (k in 1:nsim)
{
mydata = data_gen(n, ps = TRUE, reg = TRUE)
# fit the cross-fitted doubly robust estimator
M = 5 # use a 5 fold version
allidx = sample(1:n, n)
index <- seq_along(allidx)
factor_levels <- cut(index, breaks=M, labels=FALSE)
chunks <- split(allidx, factor_levels)
cross_est_fold = rep(NA, M)
for (j in 1:M)
{
cross_est_fold[j] = dr_cross(mydata$a, mydata$y, mydata$x, chunks[[j]])
}
# estimated values
result_drcross[k] = mean(cross_est_fold) - mydata$tau
}
# summarize results from all simulations
mean(result_drcross)
## [1] 0.01099088
sd(result_drcross)
## [1] 0.117282
In practice, it is often very difficult to theoretically derive the variance of an estimator, since it may be affected by many unknown components. One way to estimate the variance is to use the bootstrap. It is well known that the bootstrap approach can be used to approximate the sampling distribution. Mathematically, if we denote the observed data as \(X_1, X_2, \ldots, X_n\), drawn independently from an underlying distribution \(\mathbf{P}\), then we define the empirical distribution of this sample as \(\mathbf{P}_n = n^{-1} \sum_{i = 1}^n \delta(X_i)\), where \(\delta\) places a point mass at \(X_i\) and is 0 otherwise. The following is an example of the empirical estimation of the c.d.f of a univariate variable.
# obtain some samples from normal
n = 50
x = rnorm(n)
# calculate the empirical cdf using apply
u = seq(-3.5, 3.5, length.out = 200)
mycdf <- function(u, X)
{
cdfe = rep(0, length(u))
for (i in 1:length(u))
cdfe[i] = mean(X <= u[i])
return(cdfe)
}
Fn = mycdf(u, x)
# plot the estimated cdf
plot(u, Fn, type = "l", xlab = "x", ylab = "F(x)")
# the x bar and confidence interval of x bar
mean(x)
## [1] 0.0493282
c(mean(x) - 1.96*sd(x)/sqrt(n), mean(x) + 1.96*sd(x)/sqrt(n))
## [1] -0.2209095 0.3195659
Since we can expect the sampling distribution \(\mathbf{P}_n\) to converge to the true distribution \(\mathbf{P}\) as \(n\) grows, we can then anticipate that random and independent draws with replacement (bootstrap samples
) from the distribution \(\mathbf{P}_n\) will also converge to \(\mathbf{P}\). This comes with important consequences that
Here is an example for estimating the variance of \(\bar{x}\), but using bootstrap samples:
# plot the estimated cdf
plot(u, Fn, type = "l", xlab = "x", ylab = "F(x)", lwd = 2, col = "red")
# bootstrapped means
nbs = 200
xbarb = rep(0, nbs)
for (i in 1:nbs)
{
xb = sample(x, replace = TRUE)
points(u, mycdf(u, xb), type = "s", col = "gray")
xbarb[i] = mean(xb)
}
# bootstrap estimation of 90% confidence interval of xbar
quantile(xbarb, c(0.05, 0.95))
## 5% 95%
## -0.1551600 0.3401761
Using this idea, we can easily implement a bootstrap estimation of the variance for any ATE estimators. The following code is an example for the IPW estimators.
n = 500
nbs = 500
mydata = data_gen(n, ps = TRUE, reg = TRUE)
point.est = ATE_Est(mydata$a, mydata$y, mydata$x)
boot.est = replicate(nbs, {
id.boot = sample(1:n, n, replace = TRUE )
ATE_Est(mydata$a[id.boot], mydata$y[id.boot], mydata$x[id.boot, ]) - mydata$tau
})
rbind(point.est,
"sd" = apply(boot.est, 1, sd))
## [,1] [,2] [,3] [,4]
## point.est 0.08127117 0.1065476 0.1349242 0.06750291
## sd 0.10209857 0.2266657 0.2092851 0.11934551
In fact a convenient way of implementing cross-fitting is to use the out-of-bag mechanism in random forests. Since random forest uses a randomly sampled subset of the data to grow each tree, the prediction of a subject can be done with all trees that do not involve this subject. This is called the out-of-bag prediction, which can be independent from the subject.↩︎