Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: A Simulation Study

This is a simulation study. You are required to design your own simulation data generator and then compare the performance of two estimators: the naive difference-in-means estimator and the inverse-propensity weighted estimator. The two simulation scenarios should satisfy the following:

Briefly explain why the DIM (difference-in-means) estimator is biased in your settings. For both scenarios, replicate the simulation 100 times and compare the mean and sd of the two estimators.

Question 2: Estimated vs. True Propensity Score

In our lecture, we briefly mentioned a property that using estimated propensity score can actually be better than using it theoretical truth. Let’s use a simulation study to see if this is in fact true. For Scenarios 1 in Question 1, use the following setting:

Consider two IPW estimators, one using the true propensity score and the other using the estimated propensity score. Compare the mean and sd of the two estimators over your simulation runs. If your doubt your conclusion, also try a much larger sample size, and see if the conclusion changes.

Question 3: Matching Methods

Load the AOD data from the twang package. This data contains 600 observations and 5 covariates. The data was used in McCaffrey et al. (2013). The data is about the treatment effect for substance abuse treatment. The treatment variable is treat and the outcome variable is suf12. Please note that the treat variable has three categories. For this question, we will restrict ourselves to the data that received community (traditional programs, consider this as the control) or metcbt5 (MET/CBT-5: evidence-based motivational enhancement therapy plus cognitive behavior therapy) as the treatment label. In our lecture, we gave an example of matching using propensity score. In this homework, let’s consider an additional idea, which is covariate matching, meaning that, we consider the Euclidean distance between the covariates of the treated and control groups, and select the one that is closest. The covariates you should consider are anything besides treat and suf12 in the dataset. Perform the following two matching methods using your own code. Please note that, for both methods, we are only interested in the average treatment effect of the treated (ATT), meaning that we want to estimate the treatment effect of those who received metcbt5 treatment. Report the two matching methods in terms of the estimated ATT.

Discuss in the covariate matching, what properties of the data could significantly affect the performance of the matching method. What can you do to mitigate these issues? You do not have to implement them.

Question 4: Invariance of IPW Estimator

The IPW estimator (Horvitz–Thompson Estimator) is not location invariant. If we add constant c to all observations, the IPW estimator will change. Show that the IPW estimator is not location invariant. To address this issue, a new estimator called the Hajek estimator was proposed:

\(\hat{\tau}_{\text{hajek}} = \frac{\sum_{i=1}^n \frac{A_iY_i}{\hat{e}_i}}{\sum_{i=1}^n \frac{A_i}{\hat{e}_i}} - \frac{\sum_{i=1}^n \frac{(1-A_i)Y_i}{1-\hat{e}_i}}{\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i}}\)

Proof that the Hajek estimator is location invariant, meaning that when adding a constant c to all observations, the Hajek estimator will not change. Calculate this estimator based on your data in Question 2. Report the mean and sd of the estimator over your simulation runs.