Stat 546 Homework 4

Instruction
Question 1: Estimators of ATE
Question 2: Effect of Unobserved Confounders
Question 3: Matching Methods
Question 4: Invariance of IPW Estimator

Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

You are required to submit the rendered file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file. .html format cannot be accepted since they may not be readable in Gradescope. All proofs must be typed in LaTeX format. Make all of your R code chunks visible for grading.
Include your Name and NetID in the report.
If you use this file or the example homework .Rmd file as a template, be sure to remove this instruction section.
On random seed and reproducibility: You should use R version \(\geq 4.0.0\). This will ensure your random seed generation is the same as everyone else. Please note that updating the R version may require you to reinstall all of your packages. In the markdown file, set seed properly so that the results can be reproduced.
For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.
Using AI tools: you should provide a brief statement on the involvement of AI tools on your homework.

Question 1: Estimators of ATE

We learned a few estimators of the average treatment effect (ATE) in the lecture. In this question, we will compare their performance in a simulation study. Use the following data generating process. We simulate i.i.d. samples \((X, A, Y)\) with nonlinear effects. Let \[ X = (X_1, X_2, X_3, X_4, X_5), \] where \(X_1, X_2 \sim \mathcal{N}(0,1)\), \(X_3 \sim \mathrm{Unif}[-1,1]\), \(X_4 \sim \mathrm{Bernoulli}(0.4)\), and \(X_5 \sim \mathrm{Exp}(1)\). The treatment label \(A\) follows \(\mathrm{Bernoulli}(e(X))\), with \[ \text{logit}\{e(X)\} = 0.6 \, X_1 + 0.6 \, X_2^2 - 0.8 \cdot \mathbf{1}\{X_3>0\}, \]

The potential outcomes are generated with \[ Y(0) = 1 + 0.8X_1 + 0.5X_4 + 0.3\log(1+X_5) + \varepsilon\\ \]

with \(\varepsilon \sim \mathcal{N}(0,1)\). And \(Y(1) = Y(0) + \tau(X)\), with

\[ \tau(X) = 1 - 0.3X_2 + 0.4\,\mathbf{1}\{X_3>0\} + 0.2\sqrt{X_5}. \]

a). [5 pts] Approximate the true average treatment effect \(\tau = \mathbb{E}[\tau(X)]\) using a Monte Carlo approach with 10000 samples.

Answer:

set.seed(546)
n <- 10000

# covariates
X <- data.frame(
  X1 = rnorm(n),
  X2 = rnorm(n),
  X3 = runif(n, -1, 1),
  X4 = rbinom(n, 1, 0.4),
  X5 = rexp(n, 1)
)
# true ATE: tau(X) = 1 - 0.3 X2 + 0.4 1{X3>0} + 0.2 sqrt(X5)
tau_vals <- 1 - 0.3 * X$X2 + 0.4 * (X$X3 > 0) + 0.2 * sqrt(X$X5)
tau <- mean(tau_vals)
tau

## [1] 1.3845

b). [15 pts] In practice, we observe \(Y = A\,Y(1) + (1-A)\,Y(0)\) instead of the potential outcomes. Simulate a dataset of size \(n=500\) from the above data generating process. Implement the IPW estimator

\[ \hat{\tau}_{\text{IPW}} = \frac{1}{n} \sum_{i=1}^n \left( \frac{A_i Y_i}{\hat{e}(X_i)} - \frac{(1-A_i) Y_i}{1-\hat{e}(X_i)} \right), \]

in which \(\hat{e}(X)\) is the estimated propensity score using

A logistic regression with main effects of \(X\)
A generalized random forest (grf) using the probability_forest() function with default settings (see documentation)

Perform this simulation 1000 times and report the mean and standard deviation of both approaches. Which one performs better? Can you provide some explaination? Investigate why the grf estimator performs worse: biasedness in the propensity score estimation. Plot the estimated propensity scores against the true propensity scores for one simulated dataset.

Answer:

set.seed(546)
library(grf)

n <- 500          # sample size per dataset
n_sim <- 1000     # number of simulations

logistic_est  <- numeric(n_sim)
grf_est       <- numeric(n_sim)

logistic_ex_err <- numeric(n_sim)
grf_ex_err      <- numeric(n_sim)

for (sim in 1:n_sim) {
  
  # covariates
  X <- data.frame(
    X1 = rnorm(n),
    X2 = rnorm(n),
    X3 = runif(n, -1, 1),
    X4 = rbinom(n, 1, 0.4),
    X5 = rexp(n, 1)
  )
  
  # true propensity score: logit{e(X)} = 0.6 X1 + 0.6 X2^2 - 0.8 1{X3>0}
  logit_e <- 0.6 * X$X1 + 0.6 * X$X2^2 - 0.8 * (X$X3 > 0)
  ex <- exp(logit_e) / (1 + exp(logit_e))
  
  # treatment assignment
  A <- rbinom(n, 1, ex)
  
  # potential outcomes
  Y0 <- 1 + 0.8 * X$X1 + 0.5 * X$X4 + 0.3 * log(1 + X$X5) + rnorm(n)
  tau_X <- 1 - 0.3 * X$X2 + 0.4 * (X$X3 > 0) + 0.2 * sqrt(X$X5)
  Y1    <- Y0 + tau_X
  
  # observed outcome
  Y <- A * Y1 + (1 - A) * Y0
  
  # estimate propensity score using logistic regression
  ps_logistic <- glm(A ~ X1 + X2 + X3 + X4 + X5, data = X, family = binomial)
  e_hat_logistic <- predict(ps_logistic, type = "response")
  logistic_ex_err[sim] <- mean((e_hat_logistic - ex)^2)
  
  # IPW estimator with logistic regression
  ipw_logistic <- mean(A * Y / e_hat_logistic - (1 - A) * Y / (1 - e_hat_logistic))
  logistic_est[sim] <- ipw_logistic
  
  # estimate propensity score using GRF
  ps_grf <- probability_forest(X, as.factor(A))
  e_hat_grf <- predict(ps_grf)$predictions
  grf_ex_err[sim] <- mean((e_hat_grf[, 2] - ex)^2)
  
  # IPW estimator with GRF
  ipw_grf <- mean(A * Y / e_hat_grf[, 2] - (1 - A) * Y / e_hat_grf[, 1])
  grf_est[sim]  <- ipw_grf
}

boxplot(logistic_est, grf_est,
        names = c("Logistic", "GRF"),
        main  = "IPW Estimators Comparison",
        ylab  = "Estimated ATE")

# add true ATE line from part (a)
abline(h = tau, col = "red", lty = 2)

# mean and sd
mean(logistic_est); sd(logistic_est)

## [1] 1.37556

## [1] 0.1081751

mean(grf_est);      sd(grf_est)

## [1] 1.504544

## [1] 0.1014047

c). [15 pts] We also studied the doubly robust (DR) estimator which could potentially improve the asymptotic efficiency. Let’s consider using a linear regression model as the outcome regression model. Hence, in this case, it is still biased. Use the previous simulated datasets, and the two propensity score estimation methods (logistic regression and grf), implement the two versions of the DR estimator. Please note that you should implement the cross-fitting version (two-fold) of this doubly robust estimator to archive its theoretical properties.

Which one do you expect to perform better?
Report the mean and standard deviation of both approaches over 100 simulations.
Does your result matches your expectation? Which method was able to improve from the previous IPW version? Please provide some explanation. Be aware that this is a simulation study, and the results may vary.

Answer:

## (c) – DR with cross-fitting helper
dr_cross <- function(a, y, x, idx, method = c("logistic", "grf")) {
  method <- match.arg(method)
  
  alldata     <- data.frame(a = a, y = y, x)
  data_plugin <- alldata[-idx, ]
  data_est    <- alldata[idx, ]
  
  # outcome regression models
  m1_fit <- lm(y ~ ., data = data_plugin[data_plugin$a == 1, -1])  # fit on treated
  m0_fit <- lm(y ~ ., data = data_plugin[data_plugin$a == 0, -1])  # fit on control
  
  m1_hat <- predict(m1_fit, newdata = data_est)
  m0_hat <- predict(m0_fit, newdata = data_est)
  
  # propensity score model
  if (method == "logistic") {
    ps_model <- glm(a ~ ., data = data_plugin[, -2], family = binomial)
    e_hat <- predict(ps_model, newdata = data_est, type = "response")
  } else if (method == "grf") {
    ps_model <- probability_forest(data_plugin[, -c(1, 2)], as.factor(data_plugin$a))
    e_hat <- predict(ps_model, newdata = data_est[, -c(1, 2)])$predictions[, 2]
  }
  
  # DR estimator on this fold
  dr_crossfit <- mean(
    m1_hat - m0_hat +
      data_est$a * (data_est$y - m1_hat) / e_hat -
      (1 - data_est$a) * (data_est$y - m0_hat) / (1 - e_hat)
  )
  
  return(dr_crossfit)
}

## (c) – DR simulation
set.seed(546)
library(grf)

n <- 500
n_sim <- 100

dr_logistic_est <- numeric(n_sim)
dr_grf_est      <- numeric(n_sim)

for (sim in 1:n_sim) {
  
  # covariates
  X <- data.frame(
    X1 = rnorm(n),
    X2 = rnorm(n),
    X3 = runif(n, -1, 1),
    X4 = rbinom(n, 1, 0.4),
    X5 = rexp(n, 1)
  )
  
  # true propensity score
  logit_e <- 0.6 * X$X1 + 0.6 * X$X2^2 - 0.8 * (X$X3 > 0)
  ex <- exp(logit_e) / (1 + exp(logit_e))
  
  # treatment assignment
  A <- rbinom(n, 1, ex)
  
  # potential outcomes
  Y0 <- 1 + 0.8 * X$X1 + 0.5 * X$X4 + 0.3 * log(1 + X$X5) + rnorm(n)
  tau_X <- 1 - 0.3 * X$X2 + 0.4 * (X$X3 > 0) + 0.2 * sqrt(X$X5)
  Y1 <- Y0 + tau_X
  
  # observed outcome
  Y <- A * Y1 + (1 - A) * Y0
  
  # cross-fitting indices
  idx1 <- sample(1:n, n/2)
  idx2 <- setdiff(1:n, idx1)
  
  # DR estimator with logistic regression
  dr_logistic_1 <- dr_cross(A, Y, X, idx1, method = "logistic")
  dr_logistic_2 <- dr_cross(A, Y, X, idx2, method = "logistic")
  dr_logistic_est[sim] <- (dr_logistic_1 + dr_logistic_2) / 2
  
  # DR estimator with GRF
  dr_grf_1 <- dr_cross(A, Y, X, idx1, method = "grf")
  dr_grf_2 <- dr_cross(A, Y, X, idx2, method = "grf")
  dr_grf_est[sim] <- (dr_grf_1 + dr_grf_2) / 2
}

boxplot(dr_logistic_est, dr_grf_est,
        names = c("Logistic", "GRF"),
        main  = "Doubly Robust Estimators Comparison",
        ylab  = "Estimated ATE")

# add true ATE line
abline(h = tau, col = "red", lty = 2)

# mean and sd
mean(dr_logistic_est); sd(dr_logistic_est)

## [1] 1.369509

## [1] 0.1156536

mean(dr_grf_est);      sd(dr_grf_est)

## [1] 1.376197

## [1] 0.104918

The logistic regression based DR estimator performs better than the previous IPW version, as it reduces bias by incorporating outcome regression information. The variance also reduced slightly, but not significant. This may or may not be caused by randomness in the simulation. The GRF based DR estimator do not show significant improvement since the regression model is also biased slightly, and the random forests estimator also suffers from bias slightly as shown in part (b).

Question 2: Effect of Unobserved Confounders

In practice, it is impossible to know if the propensity score estimation is correctly specified, and we would not know if there are unobserved confounders. Consider this following simple regression model

\[ Y = \beta_0 + X_1 + A \cdot X_2 + \varepsilon, \]

where \(X_1\) and \(X_2\) are jointly normally distributed with mean 0 but unknown covariance structure, and \(\varepsilon \sim \mathcal{N}(0,1)\). In this question, we will use a logistic regression to estimate the propensity score, but assuming that you only get to observe \(X_1\), not \(X_2\). And we will use the IPW estimator to estimate the average treatment effect. You are asked to create two models for the propensity score model \(P(A=1|X_1, X_2)\) and the covariance structure of \(X_1\) and \(X_2\) such that

The IPW estimator for ATE will be systematically biased upward.
The IPW estimator for ATE will be systematically biased downward.

You need to properly explain your design of the model, and then demonstrate your idea with a simulation study with sample size 200 and 100 replications. Report the mean and standard deviation of the IPW estimator in both cases, and compare them with the true ATE (either simulated or calculated analytically).

Answer:

The essential idea is to create correlation between \(X_1\) and \(X_2\), so that when we only observe \(X_1\), the estimated propensity score will be confounded with the effect of \(X_2\) on treatment assignment.

n     <- 200
n_sim <- 100

# setting 1: upward bias
set.seed(546)
ipw_est <- numeric(n_sim)
for (sim in 1:n_sim) {
# generate correlated X1 and X2
Sigma <- matrix(c(1, 0.8, 0.8, 1), nrow = 2)
X     <- MASS::mvrnorm(n, mu = c(0, 0), Sigma = Sigma)
X1    <- X[, 1]
X2    <- X[, 2]

# treatment assignment with positive dependence on X2
logit_e <- 0.8 * X2
ex      <- plogis(logit_e)
A       <- rbinom(n, 1, ex)

# outcome: Y = 2 + X1 + A * X2 + eps
Y <- 2 + X1 + A * X2 + rnorm(n)

# estimate propensity score using only X1
ps_model <- glm(A ~ X1, family = binomial)
e_hat <- predict(ps_model, type = "response")

# IPW estimator
ipw_est[sim] <- mean(A * Y / e_hat - (1 - A) * Y / (1 - e_hat))}

# results (upward bias relative to true ATE = 0)
mean(ipw_est)

## [1] 0.1334236

sd(ipw_est)

## [1] 0.1642203

# correlation between PS error and X2 in the last replicate
cor(e_hat - ex, X2)  # should be negative

## [1] -0.5166494

Here are some explanations

In this simulation, since \(X_2\) is marginally follows standard normal distribution, the true ATE is 0. However, since the treatment effect term \(\tau\) is positively associated with \(X_2\), it means that when \(X_2\) is large, the treatment effect is large.
On the other hand, when we estimate the propensity score using only \(X_1\), due to the positive correlation between \(X_1\) and \(X_2\), the estimated propensity score will be larger when \(X_2\) is large. However, this is not the true model, and we know that the full effect of \(X_2\) on treatment assignment is not accounted for. In other words, when \(X_2\) is large, the estimated propensity score will not be large enough. Hence the estimated propensity score will be much smaller than the true propensity score.
This creates a negative correlation between the estimation error (\(\hat e - e\)) of the propensity score and \(X_2\).
In the IPW estimator, when the estimated propensity score is smaller than the true propensity score, the weight for treated units will be larger than it should be. Since the treatment effect is positively associated with \(X_2\), this means that units with large treatment effects will be given larger weights, leading to an upward bias in the IPW estimator.

To reverse the bias direction, we can set the correlation between \(X_1\) and \(X_2\) to be negative, and also reverse the effect of \(X_2\) on treatment assignment. In this case, the estimation error of the propensity score will be positively correlated with \(X_2\), leading to a downward bias in the IPW estimator.

n     <- 200
n_sim <- 100

# setting 2: downward bias
set.seed(546)
ipw_est <- numeric(n_sim)
for (sim in 1:n_sim) {
# generate correlated X1 and X2
Sigma <- matrix(c(1, -0.8, -0.8, 1), nrow = 2)
X     <- MASS::mvrnorm(n, mu = c(0, 0), Sigma = Sigma)
X1    <- X[, 1]
X2    <- X[, 2]

# treatment assignment with negative dependence on X2
logit_e <- -0.8 * X2
ex      <- plogis(logit_e)
A       <- rbinom(n, 1, ex)

# outcome
Y <- 2 + X1 + A * X2 + rnorm(n)

# estimate propensity score using only X1
ps_model <- glm(A ~ X1, family = binomial)
e_hat <- predict(ps_model, type = "response")

# IPW estimator
ipw_est[sim] <- mean(A * Y / e_hat - (1 - A) * Y / (1 - e_hat))}

# results (downward bias relative to true ATE = 0)
mean(ipw_est)

## [1] -0.1267328

sd(ipw_est)

## [1] 0.1765004

# correlation between PS error and X2 in the last replicate
cor(e_hat - ex, X2)  # should be positive

## [1] 0.335788

Question 3: Matching Methods

Load the AOD data from the twang package. This dataset contains 600 observations and 5 covariates and was used in McCaffrey et al. (2013) to study the effect of different substance abuse treatments. The treatment variable is treat, and the outcome variable is suf12. Note that treat has three categories. For this question, restrict your analysis to observations that received either community (traditional programs, consider this as the control) or metcbt5 (MET/CBT-5: evidence-based motivational enhancement therapy plus cognitive behavior therapy) as the treatment label.

In our lecture, we discussed matching methods based on the propensity score. In this problem, you will implement two matching methods: propensity score matching and covariate matching, to estimate the average treatment effect on the treated (ATT), i.e., the treatment effect for those who received metcbt5. The covariates for matching are all variables in the dataset except treat and suf12. Perform the following two matching methods using your own code and report the estimated ATT from each approach.

Propensity Score Matching: estimate the propensity score using a binary logistic regression model where the response is treat (coded as metcbt5 vs. community). Match each treated observation (metcbt5) with the control observation (community) having the closest propensity score. Based on your matching, estimate the ATT using the difference in means of the outcome (suf12) between treated and matched controls.
Covariate Matching: For each treated observation, find the control observation with the smallest Euclidean distance in covariate space (using the same covariates as above). Based on your matching, estimate the ATT using the difference in means of suf12 between treated and matched controls.

In the covariate matching method, what characteristics of the data could significantly affect the quality of the matches? What strategies could you apply to mitigate these issues? (You do not need to implement these improvements.)

Answer:

  library(twang)
  data(AOD)
  head(AOD)

    # subest data
  df = AOD[c(AOD$treat=='community'|AOD$treat=='metcbt5'),]

  # logistic regression for propensity score
  mod = glm(treat~.-suf12,data=df,family=binomial)
  summary(mod)

## 
## Call:
## glm(formula = treat ~ . - suf12, family = binomial, data = df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.04850    0.11115  -0.436    0.663
## illact      -0.08160    0.09732  -0.839    0.402
## crimjust     0.09270    0.09701   0.956    0.339
## subprob      0.09144    0.10175   0.899    0.369
## subdep       0.01317    0.09506   0.139    0.890
## white        0.30582    0.26368   1.160    0.246
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 554.52  on 399  degrees of freedom
## Residual deviance: 550.83  on 394  degrees of freedom
## AIC: 562.83
## 
## Number of Fisher Scoring iterations: 4

  df$propensity_score = predict(mod, type = "response")
  
  df0 = df[df$treat=='community',]
  df1 = df[df$treat=='metcbt5',]
  
  # Propensity Score Matching
  matched_idx = rep(NA, nrow(df1))
  for (i in 1:nrow(df1)){
    ps_diff = abs(df0$propensity_score - df1$propensity_score[i])
    ind = which.min(ps_diff)
    matched_idx[i] = ind
  }
  
  # calculate the ATE
  mean(df1$suf12 - df0$suf12[matched_idx])

## [1] 0.0931482

  # Covariate Matching
  x0 = data.matrix(AOD[c(AOD$treat=='community'), -c(1,2)])
  x1 = data.matrix(AOD[c(AOD$treat=='metcbt5'), -c(1,2)])

  matched_idx = rep(NA, nrow(x1))
  
  for (i in 1:nrow(x1)){
    
    # calculate the distance
    diff = sweep(x0, 2, STATS = x1[i,], FUN = "-")
    dist = rowSums(diff^2)
    
    ind = which.min(dist)
    matched_idx[i] = ind
  }

  # calculate the ATE
  mean(df1$suf12 - df0$suf12[matched_idx])

## [1] 0.1775917

Question 4: Invariance of IPW Estimator

The IPW estimator (Horvitz-Thompson Estimator) is not location invariant. If we add constant c to all observations, the IPW estimator will change. Show that the IPW estimator is not location invariant. To address this issue, a new estimator called the Hajek estimator was proposed:

\[ \hat{\tau}_{\text{hajek}} = \frac{\sum_{i=1}^n \frac{A_iY_i}{\hat{e}_i}}{\sum_{i=1}^n \frac{A_i}{\hat{e}_i}} - \frac{\sum_{i=1}^n \frac{(1-A_i)Y_i}{1-\hat{e}_i}}{\sum_{i=1}^n \frac{(1-A_i)}{1-\hat{e}_i}} \]

Proof that the Hajek estimator is location invariant, meaning that when adding a constant \(c\) to all observations, the Hajek estimator will not change. Calculate this estimator based on your data in Question 1, and just use the logistic regression version. Report the mean and sd of the estimator over your simulation runs.

Answer:

Let \[ \hat{\tau}_{\text{H}} \;=\; \frac{\sum_{i=1}^n \frac{A_i Y_i}{\hat e_i}}{\sum_{i=1}^n \frac{A_i}{\hat e_i}} \;-\; \frac{\sum_{i=1}^n \frac{(1-A_i) Y_i}{1-\hat e_i}}{\sum_{i=1}^n \frac{(1-A_i)}{1-\hat e_i}}. \] Consider adding a constant \(c\) to all outcomes: \(Y_i' = Y_i + c\). Then the treated term becomes \[ \frac{\sum_{i=1}^n \frac{A_i (Y_i+c)}{\hat e_i}}{\sum_{i=1}^n \frac{A_i}{\hat e_i}} = \frac{\sum_{i=1}^n \frac{A_i Y_i}{\hat e_i} + c \sum_{i=1}^n \frac{A_i}{\hat e_i}}{\sum_{i=1}^n \frac{A_i}{\hat e_i}} = \frac{\sum_{i=1}^n \frac{A_i Y_i}{\hat e_i}}{\sum_{i=1}^n \frac{A_i}{\hat e_i}} + c. \] Similarly, the control term becomes \[ \frac{\sum_{i=1}^n \frac{(1-A_i) (Y_i+c)}{1-\hat e_i}}{\sum_{i=1}^n \frac{(1-A_i)}{1-\hat e_i}} = \frac{\sum_{i=1}^n \frac{(1-A_i) Y_i}{1-\hat e_i}}{\sum_{i=1}^n \frac{(1-A_i)}{1-\hat e_i}} + c. \] Therefore, \[ \hat{\tau}_{\text{H}}(Y+c) = \Big(\text{treated mean} + c\Big) - \Big(\text{control mean} + c\Big) = \hat{\tau}_{\text{H}}(Y), \] so the Hájek estimator is location invariant. \(\square\)

Stat 546 Homework 4

Assigned: Oct 16, 2025; Due: 11:59 PM CT, Nov 5, 2025

Instruction

Question 1: Estimators of ATE

Question 2: Effect of Unobserved Confounders

Question 3: Matching Methods

Question 4: Invariance of IPW Estimator