Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to re-install all of your packages.This is a simulation study. You are required to design your own simulation data generator and then compare the performance of two estimators: the naive difference-in-means estimator and the inverse-propensity weighted estimator. The two simulation scenarios should satisfy the following:
Briefly explain why the DID (difference-in-means) estimator is biased in your settings. For both scenarios, replicate the simulation 100 times and compare the mean and sd of the two estimators.
For Scenario 1, we assume \(Y(1)\sim N(0.5X+X^2,1)\), and \(Y(0)\sim N(0.5X,1)\), where \(X\sim N(0,1)\). Thus, the true ATE can be calculated as \(\tau = \mathbb{E}[Y(1)-Y(0)]=\mathbb{E}[X^2]=1\). We assign the treatment based on the binomial distribution \(P(A=1|X)=\frac{1}{1+e^X}\).
The DID estimator is biased in this setting, since higher \(X\) values more likely to take treatment 0, in which case the difference between \(Y(1), Y(0)\) tends to be smaller. Thus, the DID estimator gives a result smaller than the true treatment effect.
The IPW estimator shows a smaller bias and larger variance comparing with DID estimator.
n <- 200
n1 <- 100
n0 <- 100
tau <- 1 # true ATE
nsim = 100
set.seed(11)
tauhat_did <- rep(NA, nsim)
tauhat_ipw <- rep(NA,nsim)
for (i in 1:nsim) {
# generate potential outcomes
X <- rnorm(n)
Y1 <- rnorm(n, mean = 0.5*X+X^2)
Y0 <- rnorm(n, mean = 0.5*X)
# treatment label
A = rbinom(n, 1, 1 / (1 + exp(X)))
# observed outcomes
Y <- A * Y1 + (1 - A) * Y0
tauhat_did[i] <- mean(Y[A == 1]) - mean(Y[A == 0])
tauhat_ipw[i] <- mean(A*Y *(1 + exp(X))-(1-A)*Y*((1+exp(X))/exp(X)))
}
par(mfrow=c(1,2))
hist(tauhat_did)
hist(tauhat_ipw)
mean(tauhat_did)
## [1] 0.5888208
sd(tauhat_did)
## [1] 0.1879875
mean(tauhat_ipw)
## [1] 1.015779
sd(tauhat_ipw)
## [1] 0.3716539
For Scenario 2, similarly, we assume \(Y(1)\sim N(0.5X+X^2,1)\), and \(Y(0)\sim N(0.5X,1)\), where \(X\in \{-1,1\}\) with equal probabilities. Thus, the true ATE can be calculated as \(\tau = \mathbb{E}[Y(1)-Y(0)]=\mathbb{E}[X^2]=1\). We assign the treatment based on the binomial distribution \(P(A=1|X)=\frac{e^X}{1+e^X}\).
The DID estimator is biased in this setting, since higher \(X\) values more likely to take treatment 1, in which case the difference between \(Y(1), Y(0)\) tends to be larger. Thus, the DID estimator gives a result smaller than the true treatment effect.
The IPW estimator shows a smaller bias and similar variance comparing with DID estimator.
n <- 200
n1 <- 100
n0 <- 100
tau <- 1 # true ATE
nsim = 100
set.seed(11)
tauhat_did <- rep(NA, nsim)
tauhat_ipw <- rep(NA,nsim)
for (i in 1:nsim) {
# generate potential outcomes
X <- sample(c(-1, 1), nsim,replace=T)
Y1 <- rnorm(n, mean = 0.5*X+X^2)
Y0 <- rnorm(n, mean = 0.5*X)
# treatment label
A = rbinom(n, 1, exp(X) / (1 + exp(X)))
# observed outcomes
Y <- A * Y1 + (1 - A) * Y0
tauhat_did[i] <- mean(Y[A == 1]) - mean(Y[A == 0])
tauhat_ipw[i] <- mean(A*Y *(1+exp(X))/exp(X)-(1-A)*Y*(1+exp(X)))
}
par(mfrow=c(1,2))
hist(tauhat_did)
hist(tauhat_ipw)
mean(tauhat_did)
## [1] 1.490535
sd(tauhat_did)
## [1] 0.1610187
mean(tauhat_ipw)
## [1] 1.025146
sd(tauhat_ipw)
## [1] 0.2078914
In our lecture, we briefly mentioned a property that using estimated propensity score can actually be better than using it theoretical truth. Let’s use a simulation study to see if this is in fact true. For Scenarios 1 in Question 1, use the following setting:
Consider two IPW estimators, one using the true propensity score and the other using the estimated propensity score. Compare the mean and sd of the two estimators over your simulation runs. If your doubt your conclusion, also try a much larger sample size, and see if the conclusion changes.
n <- 50
n1 <- 25
n0 <- 25
tau <- 1 # true ATE
nsim = 1000
set.seed(11)
tauhat_ipw <- rep(NA,nsim)
tauhat_ipw_est <- rep(NA,nsim)
for (i in 1:nsim) {
# generate potential outcomes
X <- rnorm(n)
Y1 <- rnorm(n, mean = 0.5*X+X^2)
Y0 <- rnorm(n, mean = 0.5*X)
# treatment label
A = rbinom(n, 1, 1 / (1 + exp(X)))
# observed outcomes
Y <- A * Y1 + (1 - A) * Y0
# if we use the true propensity score
tauhat_ipw[i] <- mean(A*Y *(1 + exp(X))-(1-A)*Y*((1+exp(X))/exp(X)))
mod = glm(A~X,family='binomial')
prob = predict(mod,data.frame(X),type='response')
tauhat_ipw_est[i] <- mean(A*Y *(1/prob)-(1-A)*Y*(1/(1-prob)))
}
par(mfrow=c(1,2))
hist(tauhat_ipw)
hist(tauhat_ipw_est)
mean(tauhat_ipw)
## [1] 0.9974602
sd(tauhat_ipw)
## [1] 0.7362435
mean(tauhat_ipw_est)
## [1] 0.9249151
sd(tauhat_ipw_est)
## [1] 0.6433434
What we can see here is that using the estimated propensity score gives a smaller variance comparing with the true propensity score. However, it is slightly biased. However, if we increase the sample size to 500, these bias would be reduced. However, the estimated version still have smaller sd.
Load the AOD
data from the twang
package.
This data contains 600 observations and 5 covariates. The data was used
in McCaffrey et al. (2013). The data is about the treatment effect for
substance abuse treatment. The treatment variable is treat
and the outcome variable is suf12
. Please note that the
treat
variable has three categories. For this question, we
will restrict ourselves to the data that received community
(traditional programs, consider this as the control) or
metcbt5
(MET/CBT-5: evidence-based motivational enhancement
therapy plus cognitive behavior therapy) as the treatment label. In our
lecture, we gave an example of matching using propensity score. In this
homework, let’s consider an additional idea, which is covariate
matching, meaning that, we consider the Euclidean distance between the
covariates of the treated and control groups, and select the one that is
closest. The covariates you should consider are anything besides
treat
and suf12
in the dataset. Perform the
following two matching methods using your own code. Please note that,
for both methods, we are only interested in the average treatment effect
of the treated (ATT), meaning that we want to estimate the treatment
effect of those who received metcbt5
treatment. Report the
two matching methods in terms of the estimated ATE.
metcbt5
treatment with an observation that received
community
treatment using the propensity score. Based on
your matching, estimate the ATE using the difference-in-means
estimator.metcbt5
treatment, find a matched observation that received
community
using Euclidean distance of the covariates. Based
on your matching, estimate the ATE using the difference-in-means
estimator.Discuss in the covariate matching, what properties of the data could significantly affect the performance of the matching method. What can you do to mitigate these issues? You do not have to implement them.
library(twang)
## To reproduce results from prior versions of the twang package, please see the version="legacy" option described in the documentation.
data(AOD)
head(AOD)
# subest data
df = AOD[c(AOD$treat=='community'|AOD$treat=='metcbt5'),]
# logistic regression for propensity score
mod = glm(treat~.-suf12,data=df,family=binomial)
summary(mod)
##
## Call:
## glm(formula = treat ~ . - suf12, family = binomial, data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.41532 -1.16973 -0.00695 1.17409 1.39759
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.04850 0.11115 -0.436 0.663
## illact -0.08160 0.09732 -0.839 0.402
## crimjust 0.09270 0.09701 0.956 0.339
## subprob 0.09144 0.10175 0.899 0.369
## subdep 0.01317 0.09506 0.139 0.890
## white 0.30582 0.26368 1.160 0.246
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 554.52 on 399 degrees of freedom
## Residual deviance: 550.83 on 394 degrees of freedom
## AIC: 562.83
##
## Number of Fisher Scoring iterations: 4
df$propensity_score = predict(mod, type = "response")
df0 = df[df$treat=='community',]
df1 = df[df$treat=='metcbt5',]
matched_idx = rep(NA, nrow(df1))
# getting index of matched samples
for (i in 1:nrow(df1)){
diff = abs(df1$propensity_score[i] - df0$propensity_score)
ind = which.min(diff)
matched_idx[i] = ind
}
# calculate the ATE
mean(df1$suf12 - df0$suf12[matched_idx])
## [1] 0.0931482
The estimated treatment effect of the treated is 0.0931.
x0 = data.matrix(AOD[c(AOD$treat=='community'), -c(1,2)])
x1 = data.matrix(AOD[c(AOD$treat=='metcbt5'), -c(1,2)])
matched_idx = rep(NA, nrow(x1))
for (i in 1:nrow(x1)){
# calculate the distance
diff = sweep(x0, 2, STATS = x1[i,], FUN = "-")
dist = rowSums(diff^2)
ind = which.min(dist)
matched_idx[i] = ind
}
# calculate the ATE
mean(df1$suf12 - df0$suf12[matched_idx])
## [1] 0.1775917
The two methods performs similarly and both give a slightly positive ATE. The covariate matching method is sensitive to the scale of the covariates. If the scale of the covariates are different, the distance would be dominated by the covariate with larger scale. To mitigate this issue, we can standardize the covariates before matching.
The IPW estimator (Horvitz–Thompson Estimator) is not location invariant. If we add constant c to all observations, the IPW estimator will change. Show that the IPW estimator is not location invariant. To address this issue, a new estimator called the Hajek estimator was proposed:
\[ \hat{\tau}_{\text{hajek}} = \frac{\sum_{i=1}^n \frac{A_iY_i}{\hat{e}_i}}{\sum_{i=1}^n \frac{A_i}{\hat{e}_i}} - \frac{\sum_{i=1}^n \frac{(1-A_i)Y_i}{1-\hat{e}_i}}{\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i}} \] Proof that the Hajek estimator is location invariant, meaning that when adding a constant c to all observations, the Hajek estimator will not change. Calculate this estimator based on your data in Question 2. Report the mean and sd of the estimator over your simulation runs.
For the IPW estimator, we have \[ \hat \tau_0 = \frac{1}{n}\sum_{i=1}^n\frac{A_iY_i}{\hat e(X_i)} - \frac{Y_i(1 - A_i)}{1 - \hat e(X_i)} \]
After adding a constant \(c\) to all values of \(Y_i\), making it \(Y_i' = Y_i + c\), the modified IPW estimator component becomes: \[\begin{align*} \hat \tau_1 & = \frac{1}{n}\sum_{i=1}^n\frac{A_i(Y_i+c)}{\hat e(X_i)} - \frac{(Y_i+c)(1 - A_i)}{1 - \hat e(X_i)} \\ & = \frac{A_i(Y_i + c) (1-\hat e(X_i)) - (1-A_i)(Y_i+c)\hat e(X_i) }{\hat e(X_i) \cdot (1-\hat e(X_i))} \\ \end{align*}\]
The difference between the modified and original IPW estimator components, which shows the effect of adding \(c\), is: \[ \hat \tau_0-\hat \tau_1= \frac{1}{n}\sum_{i=1}^n \frac{c (-A_i +\hat e(X_i))}{e(X_i) (1-\hat e(X_i))} \neq 0 \]
This difference demonstrates that the IPW estimator is not location invariant because the addition of a constant \(c\) to all outcome values \(Y_i\) alters the estimator’s value through a term that depends on the treatment indicator \(A_i\) and the propensity score \(e(X_i)\).
For the Hajek estimator, \[ \hat{\tau}_0 = \frac{\sum_{i=1}^n \frac{A_iY_i}{\hat{e}_i}}{\sum_{i=1}^n \frac{A_i}{\hat{e}_i}} - \frac{\sum_{i=1}^n \frac{(1-A_i)Y_i}{1-\hat{e}_i}}{\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i}} \]
After adding a constant \(c\) to all values of \(Y_i\), making it \(Y_i' = Y_i + c\), the modified IPW estimator component becomes: \[\begin{align*} \hat \tau_1 & = \frac{\sum_{i=1}^n \frac{A_i(Y_i+c)}{\hat{e}_i}}{\sum_{i=1}^n \frac{A_i}{\hat{e}_i}} - \frac{\sum_{i=1}^n \frac{(1-A_i)(Y_i+c)}{1-\hat{e}_i}}{\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i}} \\ & = \frac{\sum_{i=1}^n \frac{A_i(Y_i+c)}{\hat{e}_i}\sum_{i=1}^n\frac{(1-A_i)}{1-\hat{e}_i} - \sum_{i=1}^n \frac{(1-A_i)(Y_i+c)}{1-\hat{e}_i}\sum_{i=1}^n\frac{A_i}{\hat{e}_i} }{(\sum_{i=1}^n \frac{A_i}{\hat{e}_i})(\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i})} \\ & = \frac{\sum_{i=1}^n \frac{A_iY_i}{\hat{e}_i}\sum_{i=1}^n\frac{(1-A_i)}{1-\hat{e}_i} - \sum_{i=1}^n \frac{(1-A_i)Y_i}{1-\hat{e}_i}\sum_{i=1}^n\frac{A_i}{\hat{e}_i} }{(\sum_{i=1}^n \frac{A_i}{\hat{e}_i})(\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i})} \qquad (=\hat \tau_0)\\ \qquad & +\frac{c\sum_{i=1}^n \frac{A_i}{\hat{e}_i}\sum_{i=1}^n\frac{(1-A_i)}{1-\hat{e}_i}-c\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i}\sum_{i=1}^n\frac{A_i}{\hat{e}_i}}{(\sum_{i=1}^n \frac{A_i}{\hat{e}_i})(\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i})} \qquad (=0)\\ & = \hat \tau_0 \end{align*}\]
Therefore, the Hajek estimator is location invariant.