Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to re-install all of your packages.This is a simulation study. You are required to design your own simulation data generator and then compare the performance of two estimators: the naive difference-in-means estimator and the inverse-propensity weighted estimator. The two simulation scenarios should satisfy the following:
Briefly explain why the DIM (difference-in-means) estimator is biased in your settings. For both scenarios, replicate the simulation 100 times and compare the mean and sd of the two estimators.
In our lecture, we briefly mentioned a property that using estimated propensity score can actually be better than using it theoretical truth. Let’s use a simulation study to see if this is in fact true. For Scenarios 1 in Question 1, use the following setting:
Consider two IPW estimators, one using the true propensity score and the other using the estimated propensity score. Compare the mean and sd of the two estimators over your simulation runs. If your doubt your conclusion, also try a much larger sample size, and see if the conclusion changes.
Load the AOD
data from the twang
package.
This data contains 600 observations and 5 covariates. The data was used
in McCaffrey et al. (2013). The data is about the treatment effect for
substance abuse treatment. The treatment variable is treat
and the outcome variable is suf12
. Please note that the
treat
variable has three categories. For this question, we
will restrict ourselves to the data that received community
(traditional programs, consider this as the control) or
metcbt5
(MET/CBT-5: evidence-based motivational enhancement
therapy plus cognitive behavior therapy) as the treatment label. In our
lecture, we gave an example of matching using propensity score. In this
homework, let’s consider an additional idea, which is covariate
matching, meaning that, we consider the Euclidean distance between the
covariates of the treated and control groups, and select the one that is
closest. The covariates you should consider are anything besides
treat
and suf12
in the dataset. Perform the
following two matching methods using your own code. Please note that,
for both methods, we are only interested in the average treatment effect
of the treated (ATT), meaning that we want to estimate the treatment
effect of those who received metcbt5
treatment. Report the
two matching methods in terms of the estimated ATT.
metcbt5
treatment with an observation that received
community
treatment using the propensity score. Based on
your matching, estimate the ATE using the difference-in-means
estimator.metcbt5
treatment, find a matched observation that received
community
using Euclidean distance of the covariates. Based
on your matching, estimate the ATE using the difference-in-means
estimator.Discuss in the covariate matching, what properties of the data could significantly affect the performance of the matching method. What can you do to mitigate these issues? You do not have to implement them.
The IPW estimator (Horvitz–Thompson Estimator) is not location invariant. If we add constant c to all observations, the IPW estimator will change. Show that the IPW estimator is not location invariant. To address this issue, a new estimator called the Hajek estimator was proposed:
\(\hat{\tau}_{\text{hajek}} = \frac{\sum_{i=1}^n \frac{A_iY_i}{\hat{e}_i}}{\sum_{i=1}^n \frac{A_i}{\hat{e}_i}} - \frac{\sum_{i=1}^n \frac{(1-A_i)Y_i}{1-\hat{e}_i}}{\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i}}\)
Proof that the Hajek estimator is location invariant, meaning that when adding a constant c to all observations, the Hajek estimator will not change. Calculate this estimator based on your data in Question 2. Report the mean and sd of the estimator over your simulation runs.