STAT 546 Homework 1

Instruction
Question 1: A Simulation Study
Question 2: Estimated vs. True Propensity Score
Question 3: Matching Methods
Question 4: Invariance of IPW Estimator

Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

You are required to submit the rendered file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file. .html format cannot be accepted. Make all of your R code chunks visible for grading.
Include your Name and NetID in the report.
If you use this file or the example homework .Rmd file as a template, be sure to remove this instruction section.
Make sure that you set seed properly so that the results can be replicated if needed.
For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.
When using AI tools, you are encouraged to document or comment on your experience with AI tools especially when your main work is taken directly from them.
On random seed and reproducibility: Make sure the version of your R is \(\geq 4.0.0\). This will ensure your random seed generation is the same as everyone else. Please note that updating the R version may require you to re-install all of your packages.

Question 1: A Simulation Study

This is a simulation study. You are required to design your own simulation data generator and then compare the performance of two estimators: the naive difference-in-means estimator and the inverse-propensity weighted estimator. The two simulation scenarios should satisfy the following:

Scenarios 1: The naive difference-in-means estimator is biased downward compared with the true ATE, in the presence of a continuous confounder.
Scenarios 2: The naive difference-in-means estimator is biased upward compared with the true ATE, in the presence of a discrete confounder.

Briefly explain why the DIM (difference-in-means) estimator is biased in your settings. For both scenarios, replicate the simulation 100 times and compare the mean and sd of the two estimators.

Question 2: Estimated vs. True Propensity Score

In our lecture, we briefly mentioned a property that using estimated propensity score can actually be better than using it theoretical truth. Let’s use a simulation study to see if this is in fact true. For Scenarios 1 in Question 1, use the following setting:

Sample size: 50
Number of simulation runs: 1000

Consider two IPW estimators, one using the true propensity score and the other using the estimated propensity score. Compare the mean and sd of the two estimators over your simulation runs. If your doubt your conclusion, also try a much larger sample size, and see if the conclusion changes.

Question 3: Matching Methods

Load the AOD data from the twang package. This data contains 600 observations and 5 covariates. The data was used in McCaffrey et al. (2013). The data is about the treatment effect for substance abuse treatment. The treatment variable is treat and the outcome variable is suf12. Please note that the treat variable has three categories. For this question, we will restrict ourselves to the data that received community (traditional programs, consider this as the control) or metcbt5 (MET/CBT-5: evidence-based motivational enhancement therapy plus cognitive behavior therapy) as the treatment label. In our lecture, we gave an example of matching using propensity score. In this homework, let’s consider an additional idea, which is covariate matching, meaning that, we consider the Euclidean distance between the covariates of the treated and control groups, and select the one that is closest. The covariates you should consider are anything besides treat and suf12 in the dataset. Perform the following two matching methods using your own code. Please note that, for both methods, we are only interested in the average treatment effect of the treated (ATT), meaning that we want to estimate the treatment effect of those who received metcbt5 treatment. Report the two matching methods in terms of the estimated ATT.

Propensity Score Matching: Estimate the propensity score using logistic regression. Then, match each observation that received metcbt5 treatment with an observation that received community treatment using the propensity score. Based on your matching, estimate the ATE using the difference-in-means estimator.
Covariate Matching: For each observation that received metcbt5 treatment, find a matched observation that received community using Euclidean distance of the covariates. Based on your matching, estimate the ATE using the difference-in-means estimator.

Discuss in the covariate matching, what properties of the data could significantly affect the performance of the matching method. What can you do to mitigate these issues? You do not have to implement them.

Question 4: Invariance of IPW Estimator

The IPW estimator (Horvitz–Thompson Estimator) is not location invariant. If we add constant c to all observations, the IPW estimator will change. Show that the IPW estimator is not location invariant. To address this issue, a new estimator called the Hajek estimator was proposed:

\(\hat{\tau}_{\text{hajek}} = \frac{\sum_{i=1}^n \frac{A_iY_i}{\hat{e}_i}}{\sum_{i=1}^n \frac{A_i}{\hat{e}_i}} - \frac{\sum_{i=1}^n \frac{(1-A_i)Y_i}{1-\hat{e}_i}}{\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i}}\)

Proof that the Hajek estimator is location invariant, meaning that when adding a constant c to all observations, the Hajek estimator will not change. Calculate this estimator based on your data in Question 2. Report the mean and sd of the estimator over your simulation runs.