STAT 546 Homework 1

Instruction
Question 1: A Simulation Study
Question 2: Estimated vs. True Propensity Score
Question 3: Matching Methods
Question 4: Invariance of IPW Estimator

Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

You are required to submit the rendered file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file. .html format cannot be accepted. Make all of your R code chunks visible for grading.
Include your Name and NetID in the report.
If you use this file or the example homework .Rmd file as a template, be sure to remove this instruction section.
Make sure that you set seed properly so that the results can be replicated if needed.
For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.
When using AI tools, you are encouraged to document or comment on your experience with AI tools especially when your main work is taken directly from them.
On random seed and reproducibility: Make sure the version of your R is \(\geq 4.0.0\). This will ensure your random seed generation is the same as everyone else. Please note that updating the R version may require you to re-install all of your packages.

Question 1: A Simulation Study

This is a simulation study. You are required to design your own simulation data generator and then compare the performance of two estimators: the naive difference-in-means estimator and the inverse-propensity weighted estimator. The two simulation scenarios should satisfy the following:

Scenarios 1: The naive difference-in-means estimator is biased downward compared with the true ATE, in the presence of a continuous confounder.
Scenarios 2: The naive difference-in-means estimator is biased upward compared with the true ATE, in the presence of a discrete confounder.

Briefly explain why the DID (difference-in-means) estimator is biased in your settings. For both scenarios, replicate the simulation 100 times and compare the mean and sd of the two estimators.

Answer:

For Scenario 1, we assume \(Y(1)\sim N(0.5X+X^2,1)\), and \(Y(0)\sim N(0.5X,1)\), where \(X\sim N(0,1)\). Thus, the true ATE can be calculated as \(\tau = \mathbb{E}[Y(1)-Y(0)]=\mathbb{E}[X^2]=1\). We assign the treatment based on the binomial distribution \(P(A=1|X)=\frac{1}{1+e^X}\).

The DID estimator is biased in this setting, since higher \(X\) values more likely to take treatment 0, in which case the difference between \(Y(1), Y(0)\) tends to be smaller. Thus, the DID estimator gives a result smaller than the true treatment effect.

The IPW estimator shows a smaller bias and larger variance comparing with DID estimator.

n <- 200
n1 <- 100
n0 <- 100
tau <- 1 # true ATE
nsim = 100

set.seed(11)
tauhat_did <- rep(NA, nsim)
tauhat_ipw <- rep(NA,nsim)
 
for (i in 1:nsim) {
    # generate potential outcomes
    X <- rnorm(n)
    Y1 <- rnorm(n, mean = 0.5*X+X^2)
    Y0 <- rnorm(n, mean = 0.5*X)
    
    # treatment label
    A = rbinom(n, 1, 1 / (1 + exp(X)))
    
    # observed outcomes
    Y <- A * Y1 + (1 - A) * Y0
    
    tauhat_did[i] <- mean(Y[A == 1]) - mean(Y[A == 0])
    tauhat_ipw[i] <- mean(A*Y *(1 + exp(X))-(1-A)*Y*((1+exp(X))/exp(X)))
}
par(mfrow=c(1,2))
hist(tauhat_did)
hist(tauhat_ipw)

mean(tauhat_did)

## [1] 0.5888208

sd(tauhat_did)

## [1] 0.1879875

mean(tauhat_ipw)

## [1] 1.015779

sd(tauhat_ipw)

## [1] 0.3716539

For Scenario 2, similarly, we assume \(Y(1)\sim N(0.5X+X^2,1)\), and \(Y(0)\sim N(0.5X,1)\), where \(X\in \{-1,1\}\) with equal probabilities. Thus, the true ATE can be calculated as \(\tau = \mathbb{E}[Y(1)-Y(0)]=\mathbb{E}[X^2]=1\). We assign the treatment based on the binomial distribution \(P(A=1|X)=\frac{e^X}{1+e^X}\).

The DID estimator is biased in this setting, since higher \(X\) values more likely to take treatment 1, in which case the difference between \(Y(1), Y(0)\) tends to be larger. Thus, the DID estimator gives a result smaller than the true treatment effect.

The IPW estimator shows a smaller bias and similar variance comparing with DID estimator.

n <- 200
n1 <- 100
n0 <- 100
tau <- 1 # true ATE
nsim = 100

set.seed(11)
tauhat_did <- rep(NA, nsim)
tauhat_ipw <- rep(NA,nsim)
 
for (i in 1:nsim) {
    # generate potential outcomes
    X <- sample(c(-1, 1), nsim,replace=T)
    Y1 <- rnorm(n, mean = 0.5*X+X^2)
    Y0 <- rnorm(n, mean = 0.5*X)
    
    # treatment label
    A = rbinom(n, 1, exp(X) / (1 + exp(X)))
    
    # observed outcomes
    Y <- A * Y1 + (1 - A) * Y0
    
    tauhat_did[i] <- mean(Y[A == 1]) - mean(Y[A == 0])
    tauhat_ipw[i] <- mean(A*Y *(1+exp(X))/exp(X)-(1-A)*Y*(1+exp(X)))
}
par(mfrow=c(1,2))
hist(tauhat_did)
hist(tauhat_ipw)

mean(tauhat_did)

## [1] 1.490535

sd(tauhat_did)

## [1] 0.1610187

mean(tauhat_ipw)

## [1] 1.025146

sd(tauhat_ipw)

## [1] 0.2078914

Question 2: Estimated vs. True Propensity Score

In our lecture, we briefly mentioned a property that using estimated propensity score can actually be better than using it theoretical truth. Let’s use a simulation study to see if this is in fact true. For Scenarios 1 in Question 1, use the following setting:

Sample size: 50 (25 in each treatment group)
Number of simulation runs: 1000

Consider two IPW estimators, one using the true propensity score and the other using the estimated propensity score. Compare the mean and sd of the two estimators over your simulation runs. If your doubt your conclusion, also try a much larger sample size, and see if the conclusion changes.

Answer:

  n <- 50
  n1 <- 25
  n0 <- 25
  tau <- 1 # true ATE
  nsim = 1000
  
  set.seed(11)
  tauhat_ipw <- rep(NA,nsim)
  tauhat_ipw_est <- rep(NA,nsim)
  
  for (i in 1:nsim) {
      # generate potential outcomes
      X <- rnorm(n)
      Y1 <- rnorm(n, mean = 0.5*X+X^2)
      Y0 <- rnorm(n, mean = 0.5*X)
      
      # treatment label
      A = rbinom(n, 1, 1 / (1 + exp(X)))
      
      # observed outcomes
      Y <- A * Y1 + (1 - A) * Y0
      
      # if we use the true propensity score
      tauhat_ipw[i] <- mean(A*Y *(1 + exp(X))-(1-A)*Y*((1+exp(X))/exp(X)))
  
      mod = glm(A~X,family='binomial')
      prob = predict(mod,data.frame(X),type='response')
      
      tauhat_ipw_est[i] <- mean(A*Y *(1/prob)-(1-A)*Y*(1/(1-prob)))
  }
  
  par(mfrow=c(1,2))
  hist(tauhat_ipw)
  hist(tauhat_ipw_est)

  mean(tauhat_ipw)

## [1] 0.9974602

  sd(tauhat_ipw)

## [1] 0.7362435

  mean(tauhat_ipw_est)

## [1] 0.9249151

  sd(tauhat_ipw_est)

## [1] 0.6433434

What we can see here is that using the estimated propensity score gives a smaller variance comparing with the true propensity score. However, it is slightly biased. However, if we increase the sample size to 500, these bias would be reduced. However, the estimated version still have smaller sd.

Question 3: Matching Methods

Load the AOD data from the twang package. This data contains 600 observations and 5 covariates. The data was used in McCaffrey et al. (2013). The data is about the treatment effect for substance abuse treatment. The treatment variable is treat and the outcome variable is suf12. Please note that the treat variable has three categories. For this question, we will restrict ourselves to the data that received community (traditional programs, consider this as the control) or metcbt5 (MET/CBT-5: evidence-based motivational enhancement therapy plus cognitive behavior therapy) as the treatment label. In our lecture, we gave an example of matching using propensity score. In this homework, let’s consider an additional idea, which is covariate matching, meaning that, we consider the Euclidean distance between the covariates of the treated and control groups, and select the one that is closest. The covariates you should consider are anything besides treat and suf12 in the dataset. Perform the following two matching methods using your own code. Please note that, for both methods, we are only interested in the average treatment effect of the treated (ATT), meaning that we want to estimate the treatment effect of those who received metcbt5 treatment. Report the two matching methods in terms of the estimated ATE.

Propensity Score Matching: Estimate the propensity score using logistic regression. Then, match each observation that received metcbt5 treatment with an observation that received community treatment using the propensity score. Based on your matching, estimate the ATE using the difference-in-means estimator.
Covariate Matching: For each observation that received metcbt5 treatment, find a matched observation that received community using Euclidean distance of the covariates. Based on your matching, estimate the ATE using the difference-in-means estimator.

Discuss in the covariate matching, what properties of the data could significantly affect the performance of the matching method. What can you do to mitigate these issues? You do not have to implement them.

  library(twang)

## To reproduce results from prior versions of the twang package, please see the version="legacy" option described in the documentation.

  data(AOD)
  head(AOD)

Propensity Score Matching

  # subest data
  df = AOD[c(AOD$treat=='community'|AOD$treat=='metcbt5'),]

  # logistic regression for propensity score
  mod = glm(treat~.-suf12,data=df,family=binomial)
  summary(mod)

## 
## Call:
## glm(formula = treat ~ . - suf12, family = binomial, data = df)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.41532  -1.16973  -0.00695   1.17409   1.39759  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.04850    0.11115  -0.436    0.663
## illact      -0.08160    0.09732  -0.839    0.402
## crimjust     0.09270    0.09701   0.956    0.339
## subprob      0.09144    0.10175   0.899    0.369
## subdep       0.01317    0.09506   0.139    0.890
## white        0.30582    0.26368   1.160    0.246
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 554.52  on 399  degrees of freedom
## Residual deviance: 550.83  on 394  degrees of freedom
## AIC: 562.83
## 
## Number of Fisher Scoring iterations: 4

  df$propensity_score = predict(mod, type = "response")
  
  df0 = df[df$treat=='community',]
  df1 = df[df$treat=='metcbt5',]

  matched_idx = rep(NA, nrow(df1))

  # getting index of matched samples
  for (i in 1:nrow(df1)){
     diff = abs(df1$propensity_score[i] - df0$propensity_score)
     ind = which.min(diff)
     matched_idx[i] = ind
  }

  # calculate the ATE
  mean(df1$suf12 - df0$suf12[matched_idx])

## [1] 0.0931482

The estimated treatment effect of the treated is 0.0931.

Covariate Matching

  x0 = data.matrix(AOD[c(AOD$treat=='community'), -c(1,2)])
  x1 = data.matrix(AOD[c(AOD$treat=='metcbt5'), -c(1,2)])

  matched_idx = rep(NA, nrow(x1))
  
  for (i in 1:nrow(x1)){
    
    # calculate the distance
    diff = sweep(x0, 2, STATS = x1[i,], FUN = "-")
    dist = rowSums(diff^2)
    
    ind = which.min(dist)
    matched_idx[i] = ind
  }

  # calculate the ATE
  mean(df1$suf12 - df0$suf12[matched_idx])

## [1] 0.1775917

The two methods performs similarly and both give a slightly positive ATE. The covariate matching method is sensitive to the scale of the covariates. If the scale of the covariates are different, the distance would be dominated by the covariate with larger scale. To mitigate this issue, we can standardize the covariates before matching.

Question 4: Invariance of IPW Estimator

The IPW estimator (Horvitz–Thompson Estimator) is not location invariant. If we add constant c to all observations, the IPW estimator will change. Show that the IPW estimator is not location invariant. To address this issue, a new estimator called the Hajek estimator was proposed:

\[ \hat{\tau}_{\text{hajek}} = \frac{\sum_{i=1}^n \frac{A_iY_i}{\hat{e}_i}}{\sum_{i=1}^n \frac{A_i}{\hat{e}_i}} - \frac{\sum_{i=1}^n \frac{(1-A_i)Y_i}{1-\hat{e}_i}}{\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i}} \] Proof that the Hajek estimator is location invariant, meaning that when adding a constant c to all observations, the Hajek estimator will not change. Calculate this estimator based on your data in Question 2. Report the mean and sd of the estimator over your simulation runs.

Answer

For the IPW estimator, we have \[ \hat \tau_0 = \frac{1}{n}\sum_{i=1}^n\frac{A_iY_i}{\hat e(X_i)} - \frac{Y_i(1 - A_i)}{1 - \hat e(X_i)} \]

After adding a constant \(c\) to all values of \(Y_i\), making it \(Y_i' = Y_i + c\), the modified IPW estimator component becomes: \[\begin{align*} \hat \tau_1 & = \frac{1}{n}\sum_{i=1}^n\frac{A_i(Y_i+c)}{\hat e(X_i)} - \frac{(Y_i+c)(1 - A_i)}{1 - \hat e(X_i)} \\ & = \frac{A_i(Y_i + c) (1-\hat e(X_i)) - (1-A_i)(Y_i+c)\hat e(X_i) }{\hat e(X_i) \cdot (1-\hat e(X_i))} \\ \end{align*}\]

The difference between the modified and original IPW estimator components, which shows the effect of adding \(c\), is: \[ \hat \tau_0-\hat \tau_1= \frac{1}{n}\sum_{i=1}^n \frac{c (-A_i +\hat e(X_i))}{e(X_i) (1-\hat e(X_i))} \neq 0 \]

This difference demonstrates that the IPW estimator is not location invariant because the addition of a constant \(c\) to all outcome values \(Y_i\) alters the estimator’s value through a term that depends on the treatment indicator \(A_i\) and the propensity score \(e(X_i)\).

For the Hajek estimator, \[ \hat{\tau}_0 = \frac{\sum_{i=1}^n \frac{A_iY_i}{\hat{e}_i}}{\sum_{i=1}^n \frac{A_i}{\hat{e}_i}} - \frac{\sum_{i=1}^n \frac{(1-A_i)Y_i}{1-\hat{e}_i}}{\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i}} \]

After adding a constant \(c\) to all values of \(Y_i\), making it \(Y_i' = Y_i + c\), the modified IPW estimator component becomes: \[\begin{align*} \hat \tau_1 & = \frac{\sum_{i=1}^n \frac{A_i(Y_i+c)}{\hat{e}_i}}{\sum_{i=1}^n \frac{A_i}{\hat{e}_i}} - \frac{\sum_{i=1}^n \frac{(1-A_i)(Y_i+c)}{1-\hat{e}_i}}{\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i}} \\ & = \frac{\sum_{i=1}^n \frac{A_i(Y_i+c)}{\hat{e}_i}\sum_{i=1}^n\frac{(1-A_i)}{1-\hat{e}_i} - \sum_{i=1}^n \frac{(1-A_i)(Y_i+c)}{1-\hat{e}_i}\sum_{i=1}^n\frac{A_i}{\hat{e}_i} }{(\sum_{i=1}^n \frac{A_i}{\hat{e}_i})(\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i})} \\ & = \frac{\sum_{i=1}^n \frac{A_iY_i}{\hat{e}_i}\sum_{i=1}^n\frac{(1-A_i)}{1-\hat{e}_i} - \sum_{i=1}^n \frac{(1-A_i)Y_i}{1-\hat{e}_i}\sum_{i=1}^n\frac{A_i}{\hat{e}_i} }{(\sum_{i=1}^n \frac{A_i}{\hat{e}_i})(\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i})} \qquad (=\hat \tau_0)\\ \qquad & +\frac{c\sum_{i=1}^n \frac{A_i}{\hat{e}_i}\sum_{i=1}^n\frac{(1-A_i)}{1-\hat{e}_i}-c\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i}\sum_{i=1}^n\frac{A_i}{\hat{e}_i}}{(\sum_{i=1}^n \frac{A_i}{\hat{e}_i})(\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i})} \qquad (=0)\\ & = \hat \tau_0 \end{align*}\]

Therefore, the Hajek estimator is location invariant.

STAT 546 Homework 1

Assigned: Mar 4, 2024; Due: 11:59 PM CT, Mar 21, 2024

Instruction

Question 1: A Simulation Study

Answer:

Question 2: Estimated vs. True Propensity Score

Answer:

Question 3: Matching Methods

Propensity Score Matching

Covariate Matching

Question 4: Invariance of IPW Estimator

Answer