Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf. For example,
HW01_rqzhu.pdf. Please note that this must be a
.pdf file. .html format
cannot be accepted since they may not be readable in
Gradescope. All proofs must be typed in LaTeX format. Make all of your
R code chunks visible for grading..Rmd file
as a template, be sure to remove this instruction
section.R version \(\geq 4.0.0\).
This will ensure your random seed generation is the same as everyone
else. Please note that updating the R version may require
you to reinstall all of your packages. In the markdown file, set
seed properly so that the results can be reproduced.We learned a few estimators of the average treatment effect (ATE) in the lecture. In this question, we will compare their performance in a simulation study. Use the following data generating process. We simulate i.i.d. samples \((X, A, Y)\) with nonlinear effects. Let \[ X = (X_1, X_2, X_3, X_4, X_5), \] where \(X_1, X_2 \sim \mathcal{N}(0,1)\), \(X_3 \sim \mathrm{Unif}[-1,1]\), \(X_4 \sim \mathrm{Bernoulli}(0.4)\), and \(X_5 \sim \mathrm{Exp}(1)\). The treatment label \(A\) follows \(\mathrm{Bernoulli}(e(X))\), with \[ \text{logit}\{e(X)\} = 0.6 \, X_1 + 0.6 \, X_2^2 - 0.8 \cdot \mathbf{1}\{X_3>0\}, \]
The potential outcomes are generated with \[ Y(0) = 1 + 0.8X_1 + 0.5X_4 + 0.3\log(1+X_5) + \varepsilon\\ \]
with \(\varepsilon \sim \mathcal{N}(0,1)\). And \(Y(1) = Y(0) + \tau(X)\), with
\[ \tau(X) = 1 - 0.3X_2 + 0.4\,\mathbf{1}\{X_3>0\} + 0.2\sqrt{X_5}. \]
a). [5 pts] Approximate the true average treatment effect \(\tau = \mathbb{E}[\tau(X)]\) using a Monte Carlo approach with 10000 samples.
b). [15 pts] In practice, we observe \(Y = A\,Y(1) + (1-A)\,Y(0)\) instead of the potential outcomes. Simulate a dataset of size \(n=500\) from the above data generating process. Implement the IPW estimator
\[ \hat{\tau}_{\text{IPW}} = \frac{1}{n} \sum_{i=1}^n \left( \frac{A_i Y_i}{\hat{e}(X_i)} - \frac{(1-A_i) Y_i}{1-\hat{e}(X_i)} \right), \]
in which \(\hat{e}(X)\) is the estimated propensity score using
grf) using the
probability_forest() function with default settings (see documentation)Perform this simulation 1000 times and report the mean and standard deviation of both approaches. Which one performs better? Can you provide some explaination?
c). [15 pts] We also studied the doubly robust (DR) estimator which
could potentially improve the asymptotic efficiency. Let’s consider
using a linear regression model as the outcome regression model. Hence,
in this case, it is still biased. Use the previous simulated datasets,
and the two propensity score estimation methods (logistic regression and
grf), implement the two versions of the DR estimator.
Please note that you should implement the cross-fitting version
(two-fold) of this doubly robust estimator to archive its theoretical
properties.
In practice, it is impossible to know if the propensity score estimation is correctly specified, and we would not know if there are unobserved confounders. Consider this following simple regression model
\[ Y = \beta_0 + X_1 + A \cdot X_2 + \varepsilon, \]
where \(X_1\) and \(X_2\) are jointly normally distributed with mean 0 but unknown covariance structure, and \(\varepsilon \sim \mathcal{N}(0,1)\). In this question, we will use a logistic regression to estimate the propensity score, but assuming that you only get to observe \(X_1\), not \(X_2\). And we will use the IPW estimator to estimate the average treatment effect. You are asked to create two models for the propensity score model \(P(A=1|X_1, X_2)\) and the covariance structure of \(X_1\) and \(X_2\) such that
You need to properly explain your design of the model, and then demonstrate your idea with a simulation study with sample size 200 and 100 replications. Report the mean and standard deviation of the IPW estimator in both cases, and compare them with the true ATE (either simulated or calculated analytically).
Load the AOD data from the twang package.
This dataset contains 600 observations and 5 covariates and was used in
McCaffrey et al. (2013) to study the effect of different substance abuse
treatments. The treatment variable is treat, and the
outcome variable is suf12. Note that treat has
three categories. For this question, restrict your analysis to
observations that received either community (traditional
programs, consider this as the control) or metcbt5
(MET/CBT-5: evidence-based motivational enhancement therapy plus
cognitive behavior therapy) as the treatment label.
In our lecture, we discussed matching methods based on the propensity
score. In this problem, you will implement two matching methods:
propensity score matching and covariate matching, to estimate the
average treatment effect on the treated (ATT), i.e., the treatment
effect for those who received metcbt5. The covariates for
matching are all variables in the dataset except treat and
suf12. Perform the following two matching methods using
your own code and report the estimated ATT from each approach.
Propensity Score Matching: estimate the propensity score using a
binary logistic regression model where the response is
treat (coded as metcbt5
vs. community). Match each treated observation
(metcbt5) with the control observation
(community) having the closest propensity score. Based on
your matching, estimate the ATT using the difference in means of the
outcome (suf12) between treated and matched
controls.
Covariate Matching: For each treated observation, find the
control observation with the smallest Euclidean distance in covariate
space (using the same covariates as above). Based on your matching,
estimate the ATT using the difference in means of suf12
between treated and matched controls.
In the covariate matching method, what characteristics of the data could significantly affect the quality of the matches? What strategies could you apply to mitigate these issues? (You do not need to implement these improvements.)
The IPW estimator (Horvitz-Thompson Estimator) is not location invariant. If we add constant c to all observations, the IPW estimator will change. Show that the IPW estimator is not location invariant. To address this issue, a new estimator called the Hajek estimator was proposed:
\[ \hat{\tau}_{\text{hajek}} = \frac{\sum_{i=1}^n \frac{A_iY_i}{\hat{e}_i}}{\sum_{i=1}^n \frac{A_i}{\hat{e}_i}} - \frac{\sum_{i=1}^n \frac{(1-A_i)Y_i}{1-\hat{e}_i}}{\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i}} \]
Proof that the Hajek estimator is location invariant, meaning that when adding a constant \(c\) to all observations, the Hajek estimator will not change. Calculate this estimator based on your data in Question 1, and just use the logistic regression version. Report the mean and sd of the estimator over your simulation runs.