Homework Description

This homework contains a mini project for individualized decision making and conditional average treatment effect estimation. We will use a real dataset and apply the methods learned in class to analyze the data and draw conclusions. Some parts of this homework can be open ended. You are encouraged to explore and be creative. Some questions may also be computationally intensive, depends on your choices of methods and parameters. Hence you may decide your approach accordingly, as long as you can answer the questions and justify your choices.

The dataset we will be using comes from Kaggle Personalize Expedia Hotel Searches. Please be careful that our research goal is different from the competition goal, and you should read the description of the data variables carefully before you proceed. Here is what we want to do:

The following code loads the data.

rm(list = ls())
traindata = read.csv("expedia_train_rl.csv")
colnames(traindata)
##  [1] "prop_id"                   "srch_destination_id"      
##  [3] "time_idx"                  "srch_length_of_stay"      
##  [5] "srch_room_count"           "srch_saturday_night_bool" 
##  [7] "prop_location_score1"      "prop_location_score2"     
##  [9] "prop_log_historical_price" "prop_review_score"        
## [11] "prop_starrating"           "random_bool"              
## [13] "position"                  "price_usd"                
## [15] "promotion_flag"            "gross_bookings_usd"       
## [17] "booking_bool"              "click_bool"               
## [19] "comp_rate"                 "comp_inv"
testdata = read.csv("expedia_test_rl.csv")

Remark

Question 1: Validity of Causal Inference (20 points)

We have learned many assumptions related to causal inference and conditional average treatment effects in class. After carefully reading the documentation of this data, do you think this dataset satisfies assumptions we need? Please discuss each assumption that you think are relevant to our problem (CATE estimation and individualized decision making), and whether you think the assumption is likely to hold or not. You should provide justifications for your answers. Keep in mind that regardless of the validity of the assumptions, we will still proceed to analyze the data in later questions, and moving forward, we will be treat each row of this data as independent observations.

Answer:

Assumption Role for CATE / policy How to check / diagnose
SUTVA – no interference Outcome depends only on own treatment Argue from design that units do not interact.
Check repeated IDs / clusters.
SUTVA – consistency Observed Y equals potential Y under received A Treatment must be clearly defined (0/1).
Outcomes correctly recorded.
Unconfoundedness \(A ⟂ \{Y(0),Y(1)\} \mid X\) Diagnose indirectly via covariate imbalance.
Check balance before/after weighting.
Argue about missing covariates.
Positivity / overlap \(0 < P(A=1\mid X=x) < 1\) Estimate propensity scores and inspect overlap.
Check tails and IPW weight distribution.
*_i.i.d. sampling** Observations independent & same distribution Look for clustering by IDs and time trends.
Check outcome or treatment vs time.
Correct PS model Needed for IPW / DR validity Check AUC and calibration.
Most important: covariate balance after weighting.
Correct outcome model Needed for regression-based CATE and DR Residual diagnostics and CV predictive performance.
Compare flexible vs linear models.

SUTVA (structure of the sample)

In the real Expedia platform, users compare many hotels within the same search session, so a promotion for one hotel can affect the relative attractiveness and ranking of competing hotels. By construction, however, the HW dataset restricts each destination to a single hotel, largely removing within-search competition and making the no-interference part of SUTVA more plausible for this homework.

(Note: In the data (destination_ID, prop_ID) pair defines a unique hotel. There are hotels in different geographical locations with the same property ID in the Expedia data. This is the reason for below length(unique(prop_id) smaller than Unique destinations.)

## Number of (destination, hotel) pairs and repetitions
with(traindata, {
  cat("Unique destinations:", length(unique(srch_destination_id)), "\n")
  cat("Unique hotels:", length(unique(prop_id)),"\n")
  tab <- table(srch_destination_id, prop_id)
  cat("Max times a (dest,hotel) pair appears:", max(tab), "\n")
})
## Unique destinations: 600 
## Unique hotels: 587 
## Max times a (dest,hotel) pair appears: 600
## consistency sanity checks
stopifnot(all(traindata$promotion_flag %in% 0:1),
          all(traindata$booking_bool   %in% 0:1),
          all(traindata$click_bool     %in% 0:1))
table(traindata$booking_bool,traindata$gross_bookings_usd > 0,useNA = "ifany")
##    
##     FALSE TRUE
##   0  5959    0
##   1     0  378
print(paste("Number of rows (booking = 0 and revenue > 0)=", sum(traindata$booking_bool == 0 & traindata$gross_bookings_usd > 0, na.rm = TRUE)))
## [1] "Number of rows (booking = 0 and revenue > 0)= 0"

In above the summary table column is booking_bool. When booking_bool = 0, there are 5959 rows with zero revenue and 0 rows with positive revenue. When booking_bool = 1, there are 378 rows with positive revenue and 0 with zero revenue.

Unconfoundedness – covariate balance diagnostics

We assume ((Y(1),Y(0)) A X), with (X) the pre-treatment search and hotel covariates. Some confounders (stay length, room count, etc), are set before the promotion decision. However, promotion might uses additional unobserved signals. Below Love plott displays absolute standardized mean differences (SMDs) for each covariate before any adjustment between the treated group (promotion_flag = 1) and the control group (promotion_flag = 0). It visualizes baseline imbalance and shows which covariates differ the most across treatment groups. The dashed line is the standard balance threshold, usually (0.1):

library(dplyr)
library(cobalt)# for bal.tab and love.plot

treat_var <- "promotion_flag"
covs <- c("srch_length_of_stay","srch_room_count","srch_saturday_night_bool",
          "prop_location_score1","prop_location_score2","prop_log_historical_price",
          "prop_starrating","prop_review_score","comp_rate")

form_cov <- as.formula(paste(treat_var, "~", paste(covs, collapse = " + ")))
bal      <- bal.tab(form_cov, data = traindata, un = TRUE)
print(bal)
## Balance Measures
##                              Type Diff.Un
## srch_length_of_stay       Contin.  0.3904
## srch_room_count           Contin.  0.0287
## srch_saturday_night_bool   Binary -0.0936
## prop_location_score1      Contin.  0.8284
## prop_location_score2      Contin.  0.2920
## prop_log_historical_price Contin.  0.1715
## prop_starrating_4          Binary -0.0811
## prop_review_score         Contin. -0.4642
## comp_rate                 Contin.  0.1453
## 
## Sample sizes
##     Control Treated
## All    4568    1769
love.plot(bal,stats="m",abs=TRUE, thresholds=c(m = 0.1),var.order="unadjusted",
          stars= "raw",title= "Standardized mean differences before adjustment",
          line = TRUE)

  • Many covariates have SMDs between 0.3 and 0.8, meaning very strong imbalance.
  • The most imbalanced variables are prop_location_score1, prop_review_score, prop_location_score2,prop_log_historical_price, and srch_length_of_stay.
  • All of these lie well beyond the 0.1 threshold line.

Therefore, before weighting/matching, treated and control units differ substantially in key covariates; naïve comparisons of booking outcomes would be biased.

Positivity (overlap)

Positivity requires a non-zero chance of both receiving and not receiving a promotion.

library(ggplot2)
ps_form  <- as.formula(paste("promotion_flag ~", paste(covs, collapse = " + ")))
ps_glm   <- glm(ps_form, data = traindata, family = binomial)
traindata$ps_hat <- predict(ps_glm, type = "response")
mean(traindata$ps_hat < 0.01)
## [1] 0
mean(traindata$ps_hat > 0.99)
## [1] 0
ggplot(traindata, aes(ps_hat, fill = factor(promotion_flag))) +
  geom_histogram(position = "identity", bins = 30, alpha = 0.5) +
  labs(x = "Propensity score", fill = "promotion_flag")

i.i.d. sampling

The rows are not truly independent and identically distributed because multiple rows correspond to the same hotel–destination pair and the marketplace evolves over time_idx (seasonality, policy changes). This induces within-cluster dependence and non-stationarity. For the homework, we follow the instruction to treat each row as an independent observation.

The i.i.d. assumption is only an approximation in this dataset. Relative to unconfoundedness and SUTVA, minor departures from i.i.d. are a less serious threat here, but in principle cluster-robust SEs or hierarchical models would be preferable.

library(dplyr)
library(patchwork)

cluster_sizes <- traindata %>%
  count(srch_destination_id, prop_id, name = "n_obs") %>%
  arrange(desc(n_obs))

p1 <- ggplot(traindata, aes(time_idx, booking_bool)) +
  geom_smooth(se = FALSE) +
  ggtitle("Booking rate over time_idx")

p2 <- ggplot(traindata, aes(time_idx, promotion_flag)) +
  geom_smooth(se = FALSE) +
  ggtitle("Promotion probability over time_idx")

p1 + p2 

In the plots show clear time trends:

  • Booking probability drops sharply from early periods and then slowly rises.
  • Promotion probability increases steadily over time and stabilizes at a higher level.

These patterns imply: 1. Observations are not identically distributed — early searches seems differ systematically from later ones (seasonality, marketplace changes, policy shifts). 2. The promotion policy itself evolves over time, which creates time-driven confounding.

Strong time dynamics violates strict independence/identical distribution, but for the homework we proceed by treating each search row as an independent unit.

Question 2: The X-learner and DR-learner (50 points)

One of the easiest ways to estimate CATE is the X-learner. Please implement the X-learner to estimate the CATE of promotion_flag on gross_bookings_usd.

Answer:

The analysis is restricted to observations with a completed booking, since is defined only in that subset. All continuous covariates were treated as numeric and binary variables as 0/1 indicators; only pre-treatment hotel and search features were included.

For the X-learner, a ridge regression model was used as the linear estimator to control coefficient shrinkage in the presence of correlated predictors, and a tuned random forest was used as a nonparametric estimator to capture nonlinear and interaction effects. Out-of-sample diagnostics showed that the treated-arm linear outcome model produced negative test \(R^2\), indicating that a linear specification is inadequate, while the random forest achieved substantially lower test MSE and higher \(R^2\), suggesting meaningful nonlinearity in the data.

The resulting CATE estimates showed that linear X-learner ATEs were essentially zero, whereas the random forest X-learner yielded an average effect of approximately \(-0.07\) on the log-revenue scale. The DR-learner, which forms a doubly robust pseudo-outcome and is theoretically less sensitive to nuisance model misspecification, produced similar conclusions: near-zero ATE under linear models and a small negative ATE (about \(-0.03\) to \(-0.04\)) under random forest models.

Overall, the better-fitting nonparametric models consistently indicated that promotions do not increase log revenue among bookers and may slightly reduce it, with effects varying across hotel characteristics and time. The poor linear fit in the treated arm further suggests the presence of nonlinear structure that requires flexible learners.

X-Learner

We first restrict to rows with a booking and log-transform revenue: \(Y = \log\{1 + \text{gross_bookings_usd}\}, \qquad A \in \{0,1\}, \qquad X \in \mathbb{R}^p\)

We then assume a working regression model in each arm: \(Y \mid (A=a, X=x) = \mu_a(x) + \varepsilon_a, \qquad \mu_a(x) \approx x^\top \beta_a\)

and estimate \(\hat\mu_1(x),\hat\mu_0(x)\) by CV-Ridge using pre-treatment covariates only.

  • X-learner construction:

The target CATE is \(\tau(x) = \mathbb{E}[Y(1)-Y(0)\mid X=x]\) The X-learner proceeds in three analytic steps:

  1. impute counterfactual outcomes using the outcome models: \(\hat Y^{(0)}_i = \hat\mu_0(X_i),\quad \hat Y^{(1)}_i = \hat\mu_1(X_i)\).

  2. Construct pseudo-effects within each arm: \(D^{(1)}_i = Y_i - \hat\mu_0(X_i)\quad \text{for }A_i=1, \quad D^{(0)}_i = \hat\mu_1(X_i) - Y_i\quad \text{for }A_i=0.\)

We winsorize these to reduce extreme values:

\(\tilde D^{(1)}_i = \min\{\max(D^{(1)}_i, q^{(1)}_{0.01}), q^{(1)}_{0.99}\}, \qquad \tilde D^{(0)}_i = \min\{\max(D^{(0)}_i, q^{(0)}_{0.01}), q^{(0)}_{0.99}\}.\)

  1. Learn CATE functions in each arm by regressing pseudo-effects on (X): \(\hat\tau_1(x) = x^\top \hat\gamma_1 \approx \mathbb{E}[\tilde D^{(1)} \mid X=x, A=1],\qquad \hat\tau_0(x) = x^\top \hat\gamma_0 \approx \mathbb{E}[\tilde D^{(0)} \mid X=x, A=0]\)

We also estimate a propensity score \(\hat e(x) = \mathbb{P}(A=1\mid X=x)\)

Finally, the X-learner CATE combines both arms: \(\hat\tau_X(x) = (1-\hat e(x))\hat\tau_1(x) + \hat e(x)\hat\tau_0(x)\) and the (log-scale) ATE is \(\widehat{\text{ATE}} = \frac{1}{n}\sum_{i=1}^n \hat\tau_X(X_i)\).

# ====================== Setup & preprocessing ====================== #
library(dplyr)
library(glmnet)
library(ranger)
library(ggplot2)

set.seed(123)

dat <- traindata %>%mutate(Y_log = log1p(gross_bookings_usd))
# filter(booking_bool == 1) %>%
Xvars <- c(
  "time_idx","srch_length_of_stay","srch_room_count","srch_saturday_night_bool",
  "prop_location_score1","prop_location_score2","prop_log_historical_price",
  "prop_starrating","prop_review_score","comp_rate"
)

A <- dat$promotion_flag
Y <- dat$Y_log
X <- as.matrix(dat[, Xvars])

# ================== Single PS model + trimming ===================== #
ps_form <- as.formula(paste("promotion_flag ~", paste(Xvars, collapse = " + ")))
ps_glm_full <- glm(ps_form, data = dat, family = binomial)
dat$ps_hat_full <- predict(ps_glm_full, type = "response")

keep <- dat$ps_hat_full > 0.01 & dat$ps_hat_full < 0.99
dat  <- dat[keep, ]
A    <- dat$promotion_flag
Y    <- dat$Y_log
X    <- as.matrix(dat[, Xvars])
ps_hat <- dat$ps_hat_full   # PS for linear X-learner

X-learner with linear models

# ================= X-learner: ridge + linear ======================= #

# treated regression μ1
fit1    <- cv.glmnet(X[A == 1, ], Y[A == 1], alpha = 0)   # ridge
mu1_hat <- as.vector(predict(fit1, X, s = "lambda.min"))

# control regression μ0
fit0    <- cv.glmnet(X[A == 0, ], Y[A == 0], alpha = 0)
mu0_hat <- as.vector(predict(fit0, X, s = "lambda.min"))

# pseudo-effects
D1 <- Y[A == 1] - mu0_hat[A == 1]
D0 <- mu1_hat[A == 0] - Y[A == 0]

# winsorization (stabilize pseudo-effects)
q1 <- quantile(D1, c(.01, .99)); D1_w <- pmin(pmax(D1, q1[1]), q1[2])
q0 <- quantile(D0, c(.01, .99)); D0_w <- pmin(pmax(D0, q0[1]), q0[2])

X_treat <- as.data.frame(dat[A == 1, Xvars, drop = FALSE])
X_ctrl  <- as.data.frame(dat[A == 0, Xvars, drop = FALSE])

tau1_mod <- lm(D1_w ~ ., data = cbind(D1_w = D1_w, X_treat))
tau0_mod <- lm(D0_w ~ ., data = cbind(D0_w = D0_w, X_ctrl))

tau1_hat <- predict(tau1_mod, newdata = dat)
tau0_hat <- predict(tau0_mod, newdata = dat)

tau_x <- (1 - ps_hat) * tau1_hat + ps_hat * tau0_hat
ATE_x <- mean(tau_x)
ATE_x
## [1] 0.04131178
p1 = ggplot(data.frame(tau_x), aes(tau_x)) +
  geom_histogram(bins = 30, color = "white", fill = "steelblue") +
  labs(x = "CATE (X-learner, Ridge)", y = "Count")

p2 = ggplot(data.frame(t = dat$time_idx, tau = tau_x),
       aes(t, tau)) +
  geom_point(alpha = .2) +
  geom_smooth(method = "loess") +
  labs(x = "time_idx", y = "CATE")
p1+p2

Diagnostic

par(mfrow = c(1, 2))
plot(mu1_hat[A == 1], Y[A == 1] - mu1_hat[A == 1],
     xlab = "Predicted mu1", ylab = "Residual", main = "Treated residuals")
abline(h = 0, lty = 2)

plot(mu0_hat[A == 0], Y[A == 0] - mu0_hat[A == 0],
     xlab = "Predicted mu0", ylab = "Residual", main = "Control residuals")
abline(h = 0, lty = 2)

par(mfrow = c(1, 1))

X-learner with random forest

# ================== RF hyperparameter tuning (global) ============== #
fit_ranger_tuned <- function(df, y_name, Xvars,
                             min_node_vec = c(5, 10, 20),
                             mtry_vec = NULL,
                             num.trees = 800) {
  if (is.null(mtry_vec)) {
    p <- length(Xvars)
    mtry_vec <- c(max(1L, floor(sqrt(p))), max(1L, floor(p / 3)))
  }
  best_mod <- NULL
  best_err <- Inf
  stats <- data.frame()
  for (mn in min_node_vec) {
    for (mt in mtry_vec) {
      mod <- ranger(
        formula = reformulate(Xvars, response = y_name),
        data    = df[, c(y_name, Xvars)],
        num.trees = num.trees,
        mtry      = mt,
        min.node.size = mn,
        sample.fraction = 0.8,
        importance      = "impurity"
      )
      err <- mod$prediction.error
      stats <- rbind(
        stats,
        data.frame(min.node.size = mn, mtry = mt, oob_mse = err)
      )
      if (err < best_err) {
        best_err <- err
        best_mod <- mod
      }
    }
  }
  list(model = best_mod, stats = stats)
}
df_treated_full <- dat %>%
  dplyr::filter(promotion_flag == 1) %>%
  dplyr::select(Y_log, dplyr::all_of(Xvars))
df_control_full <- dat %>%
  dplyr::filter(promotion_flag == 0) %>%
  dplyr::select(Y_log, dplyr::all_of(Xvars))


tune_treated  <- fit_ranger_tuned(df_treated_full, "Y_log", Xvars)
tune_control  <- fit_ranger_tuned(df_control_full, "Y_log", Xvars)

# Extract best hyperparameters from tuning stats (robust)
idx_treated  <- which.min(tune_treated$stats$oob_mse)
idx_control  <- which.min(tune_control$stats$oob_mse)

best_treated_mn <- as.integer(tune_treated$stats$min.node.size[idx_treated])
best_treated_mt <- as.integer(tune_treated$stats$mtry[idx_treated])

best_control_mn <- as.integer(tune_control$stats$min.node.size[idx_control])
best_control_mt <- as.integer(tune_control$stats$mtry[idx_control])

# ================== K-fold cross-fitted μ1, μ0, and PS ============= #
K <- 5
n <- nrow(dat)
fold_id <- sample(rep(1:K, length.out = n))

mu1_hat_cf <- rep(NA_real_, n)
mu0_hat_cf <- rep(NA_real_, n)
ps_hat_cf  <- rep(NA_real_, n)

for (k in 1:K) {
  idx_train <- which(fold_id != k)
  idx_test  <- which(fold_id == k)

  dat_tr <- dat[idx_train, ]
  dat_te <- dat[idx_test, ]

  X_te <- dat_te[, Xvars, drop = FALSE]

  # cross-fitted PS
  ps_glm_k <- glm(ps_form, data = dat_tr, family = binomial)
  ps_hat_cf[idx_test] <- predict(ps_glm_k, newdata = dat_te, type = "response")

  # RF outcome models on treated / control separately (train only)
  df_tr_treated <- dat_tr %>%
    dplyr::filter(promotion_flag == 1) %>%
    dplyr::select(Y_log, dplyr::all_of(Xvars))
  
  df_tr_control <- dat_tr %>%
    dplyr::filter(promotion_flag == 0) %>%
    dplyr::select(Y_log, dplyr::all_of(Xvars))


  m1_rf_k <- ranger(
    formula = reformulate(Xvars, "Y_log"),
    data    = df_tr_treated,
    num.trees = 800,
    mtry      = best_treated_mt,
    min.node.size = best_treated_mn,
    sample.fraction = 0.8,
    importance      = "impurity"
  )

  m0_rf_k <- ranger(
    formula = reformulate(Xvars, "Y_log"),
    data    = df_tr_control,
    num.trees = 800,
    mtry      = best_control_mt,
    min.node.size = best_control_mn,
    sample.fraction = 0.8,
    importance      = "impurity"
  )

  mu1_hat_cf[idx_test] <- predict(m1_rf_k, data = X_te)$predictions
  mu0_hat_cf[idx_test] <- predict(m0_rf_k, data = X_te)$predictions
}

# ================== Pseudo-effects + stabilization ================= #
D1_all <- rep(NA_real_, n)
D0_all <- rep(NA_real_, n)

D1_all[A == 1] <- Y[A == 1] - mu0_hat_cf[A == 1]
D0_all[A == 0] <- mu1_hat_cf[A == 0] - Y[A == 0]

q1_rf <- quantile(D1_all[A == 1], c(0.01, 0.99), na.rm = TRUE)
q0_rf <- quantile(D0_all[A == 0], c(0.01, 0.99), na.rm = TRUE)

D1_rf_w <- D1_all
D0_rf_w <- D0_all

D1_rf_w[A == 1] <- pmin(pmax(D1_all[A == 1], q1_rf[1]), q1_rf[2])
D0_rf_w[A == 0] <- pmin(pmax(D0_all[A == 0], q0_rf[1]), q0_rf[2])

# ================== RF CATE models (τ1, τ0) ======================== #
X_df <- as.data.frame(dat[, Xvars, drop = FALSE])

df_tau1 <- data.frame(D1_rf_w = D1_rf_w[A == 1],
                      X_df[A == 1, , drop = FALSE])
df_tau0 <- data.frame(D0_rf_w = D0_rf_w[A == 0],
                      X_df[A == 0, , drop = FALSE])

tau1_rf_mod <- ranger(
  D1_rf_w ~ ., data = df_tau1,
  num.trees = 800,
  mtry      = best_treated_mt,
  min.node.size = best_treated_mn,
  sample.fraction = 0.8,
  importance      = "impurity"
)

tau0_rf_mod <- ranger(
  D0_rf_w ~ ., data = df_tau0,
  num.trees = 800,
  mtry      = best_control_mt,
  min.node.size = best_control_mn,
  sample.fraction = 0.8,
  importance      = "impurity"
)

tau1_hat_rf <- predict(tau1_rf_mod, data = X_df)$predictions
tau0_hat_rf <- predict(tau0_rf_mod, data = X_df)$predictions

# ================== Final X-learner CATE + ATE ===================== #
tau_x_rf_cf <- (1 - ps_hat_cf) * tau1_hat_rf + ps_hat_cf * tau0_hat_rf
ATE_x_rf_cf <- mean(tau_x_rf_cf)

cat("ATE (log-scale) from cross-fitted RF X-learner:", ATE_x_rf_cf, "\n")
## ATE (log-scale) from cross-fitted RF X-learner: 0.08097908
# ================== Some diagnostics / plots ======================= #
p1 = ggplot(data.frame(tau_x_rf_cf = tau_x_rf_cf),
       aes(x = tau_x_rf_cf)) +
  geom_histogram(bins = 30, color = "white") +
  labs(x = "Estimated CATE (X-learner, RF, cross-fitted)", y = "Count")

p2 = ggplot(data.frame(time_idx = dat$time_idx, tau = tau_x_rf_cf),
       aes(time_idx, tau)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "loess") +
  labs(x = "time_idx", y = "CATE (RF X-learner, cross-fitted)")

vi1 <- sort(tau1_rf_mod$variable.importance, decreasing = TRUE)
df_vi1 <- data.frame(
  var = factor(names(vi1), levels = names(vi1)),
  imp = as.numeric(vi1)
)

p3 = ggplot(df_vi1, aes(x = var, y = imp)) +
  geom_col() +
  coord_flip() +
  labs(x = "Covariate", y = "RF importance (treated CATE model)",
       title = "Variable importance – tau1_rf_mod (treated)")

p1+p2+p3

DR Learner

DR-learner builds a single pseudo-outcome \[\phi(X_i) = \mu_1(X_i)-\mu_0(X_i)+ \frac{A_i{Y_i-\mu_1(X_i)}}{e(X_i)}-\frac{(1-A_i){Y_i-\mu_0(X_i)}}{1-e(X_i)}\] and then regresses \(\phi\) on \(X\).

If outcome models and PS are estimated well (with cross-fitting), this is Neyman-orthogonal and achieves good rates; it’s more robust but can be higher-variance than X-learner in finite samples.

Compared to the X-learner, the DR-learner uses a single doubly-robust pseudo-outcome that combines both outcome and propensity models. With cross-fitted nuisance estimates, it is Neyman-orthogonal and can attain faster convergence rates, but in finite samples it can be more variable and more sensitive to extreme propensity scores, so clipping and winsorization are useful in practice.

DR-learner with linear models

# ================= DR-learner: linear base models ================== #
# uses: mu1_hat, mu0_hat, ps_hat, A, Y, X, Xvars from previous chunks

# avoid extreme PS in denominators (already trimmed, but clip a bit)
ps_lin_clip <- pmin(pmax(ps_hat, 0.02), 0.98)

# DR pseudo-outcomes
phi_dr_lin <- (mu1_hat - mu0_hat) +
  A * (Y - mu1_hat) / ps_lin_clip -
  (1 - A) * (Y - mu0_hat) / (1 - ps_lin_clip)

# small winsorization of pseudo-outcomes for stability
q_dr_lin <- quantile(phi_dr_lin, c(0.01, 0.99), na.rm = TRUE)
phi_dr_lin_w <- pmin(pmax(phi_dr_lin, q_dr_lin[1]), q_dr_lin[2])

# DR CATE model: ridge regression of phi on X
fit_dr_lin <- cv.glmnet(X, phi_dr_lin_w, alpha = 0)
tau_dr_lin_hat <- as.vector(predict(fit_dr_lin, X, s = "lambda.min"))
ATE_dr_lin <- mean(tau_dr_lin_hat)

cat("DR Learner Linear ATE = ", ATE_dr_lin)
## DR Learner Linear ATE =  -0.01175858
p1 = ggplot(data.frame(tau_dr_lin_hat = tau_dr_lin_hat),
       aes(tau_dr_lin_hat)) +
  geom_histogram(bins = 30, color = "white", fill = "steelblue") +
  labs(x = "CATE (DR-learner, linear)", y = "Count")

p2 = ggplot(data.frame(time_idx = dat$time_idx, tau = tau_dr_lin_hat),
       aes(time_idx, tau)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "loess") +
  labs(x = "time_idx", y = "CATE (DR, linear)")
p1+p2

DR-learner with non-linear models

# ============= DR-learner: RF base models (cross-fitted) =========== #
# uses: mu1_hat_cf, mu0_hat_cf, ps_hat_cf, A, Y, dat, Xvars, fit_ranger_tuned

# clip cross-fitted PS
ps_rf_clip <- pmin(pmax(ps_hat_cf, 0.02), 0.98)

# DR pseudo-outcomes (RF version)
phi_dr_rf <- (mu1_hat_cf - mu0_hat_cf) +
  A * (Y - mu1_hat_cf) / ps_rf_clip -
  (1 - A) * (Y - mu0_hat_cf) / (1 - ps_rf_clip)
# winsorize DR pseudo-outcomes
q_dr_rf <- quantile(phi_dr_rf, c(0.01, 0.99), na.rm = TRUE)
phi_dr_rf_w <- pmin(pmax(phi_dr_rf, q_dr_rf[1]), q_dr_rf[2])

X_df <- as.data.frame(dat[, Xvars, drop = FALSE])
df_dr <- data.frame(phi_dr_rf_w = phi_dr_rf_w, X_df)

# (optional) tune RF specifically for DR CATE model
cat("=== Tuning RF for DR-learner CATE model ===\n")
## === Tuning RF for DR-learner CATE model ===
tune_dr <- fit_ranger_tuned(df_dr, "phi_dr_rf_w", Xvars)

# extract best hyperparameters
idx_dr <- which.min(tune_dr$stats$oob_mse)
best_dr_mn <- as.integer(tune_dr$stats$min.node.size[idx_dr])
best_dr_mt <- as.integer(tune_dr$stats$mtry[idx_dr])

# final DR CATE RF model
tau_dr_rf_mod <- ranger(
  phi_dr_rf_w ~ ., data = df_dr,
  num.trees      = 800,
  mtry           = best_dr_mt,
  min.node.size  = best_dr_mn,
  sample.fraction = 0.8,
  importance     = "impurity"
)

tau_dr_rf_hat <- predict(tau_dr_rf_mod, data = X_df)$predictions
ATE_dr_rf <- mean(tau_dr_rf_hat)

cat("DR Learner Random Forest ATE = ",ATE_dr_rf)
## DR Learner Random Forest ATE =  0.04208706
p1= ggplot(data.frame(tau_dr_rf_hat = tau_dr_rf_hat),
       aes(tau_dr_rf_hat)) +
  geom_histogram(bins = 30, color = "white") +
  labs(x = "CATE (DR-learner, RF)", y = "Count")

p2 = ggplot(data.frame(time_idx = dat$time_idx, tau = tau_dr_rf_hat),
       aes(time_idx, tau)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "loess") +
  labs(x = "time_idx", y = "CATE (DR, RF)")

# variable importance for DR CATE RF
vi_dr <- sort(tau_dr_rf_mod$variable.importance, decreasing = TRUE)
df_vi_dr <- data.frame(
  var = factor(names(vi_dr), levels = names(vi_dr)),
  imp = as.numeric(vi_dr)
)

p3= ggplot(df_vi_dr, aes(x = var, y = imp)) +
  geom_col() +
  coord_flip() +
  labs(x = "Covariate", y = "RF importance (DR CATE model)",
       title = "Variable importance – DR-learner RF")
p1+p2+p3

Out of Sample Analysis

# ================= Test set preprocessing ========================= #
test_dat <- testdata %>%
  dplyr::filter(booking_bool == 1) %>%
  dplyr::mutate(Y_log = log1p(gross_bookings_usd))

X_test   <- as.matrix(test_dat[, Xvars])
Y_test   <- test_dat$Y_log
A_test   <- test_dat$promotion_flag
# reuse the same PS model as in training
ps_test  <- predict(ps_glm_full, newdata = test_dat, type = "response")

Linear nuisances

# ========== Out-of-sample diagnostics: linear nuisances =========== #

# Outcome models μ1, μ0 on test
mu1_test_lin <- as.vector(predict(fit1, newx = X_test, s = "lambda.min"))
mu0_test_lin <- as.vector(predict(fit0, newx = X_test, s = "lambda.min"))

mse1_lin <- mean((Y_test[A_test == 1] - mu1_test_lin[A_test == 1])^2)
mse0_lin <- mean((Y_test[A_test == 0] - mu0_test_lin[A_test == 0])^2)

r2_lin_1 <- 1 - mse1_lin / var(Y_test[A_test == 1])
r2_lin_0 <- 1 - mse0_lin / var(Y_test[A_test == 0])

cat("Linear μ1  - Test MSE:", round(mse1_lin, 4),
    "  R^2:", round(r2_lin_1, 3), "\n")
## Linear μ1  - Test MSE: 1.419   R^2: -5.275
cat("Linear μ0  - Test MSE:", round(mse0_lin, 4),
    "  R^2:", round(r2_lin_0, 3), "\n")
## Linear μ0  - Test MSE: 1.5333   R^2: -7.368
# Propensity score on test
library(pROC)
roc_lin <- pROC::roc(A_test, ps_test)
auc_lin <- pROC::auc(roc_lin)
cat("Linear PS  - Test AUC:", round(as.numeric(auc_lin), 3), "\n")
## Linear PS  - Test AUC: 0.606
ggplot(data.frame(ps_test = ps_test,
                  A_test = factor(A_test)),
       aes(ps_test, fill = A_test)) +
  geom_histogram(position = "identity", bins = 30, alpha = 0.5) +
  labs(x = "Test propensity score", fill = "promotion_flag",
       title = "PS overlap on test (linear model)")

RF nuisances

# ========== Out-of-sample diagnostics: RF nuisances =============== #

# Outcome models μ1, μ0 on test from tuned RFs
X_test_df <- as.data.frame(test_dat[, Xvars, drop = FALSE])

mu1_test_rf <- predict(tune_treated$model, data = X_test_df)$predictions
mu0_test_rf <- predict(tune_control$model, data = X_test_df)$predictions

mse1_rf <- mean((Y_test[A_test == 1] - mu1_test_rf[A_test == 1])^2)
mse0_rf <- mean((Y_test[A_test == 0] - mu0_test_rf[A_test == 0])^2)

r2_rf_1 <- 1 - mse1_rf / var(Y_test[A_test == 1])
r2_rf_0 <- 1 - mse0_rf / var(Y_test[A_test == 0])

cat("RF μ1     - Test MSE:", round(mse1_rf, 4),
    "  R^2:", round(r2_rf_1, 3), "\n")
## RF μ1     - Test MSE: 1.4367   R^2: -5.353
cat("RF μ0     - Test MSE:", round(mse0_rf, 4),
    "  R^2:", round(r2_rf_0, 3), "\n")
## RF μ0     - Test MSE: 1.4439   R^2: -6.88
# PS: same logistic model used as base for cross-fitted PS
roc_rf <- pROC::roc(A_test, ps_test)
auc_rf <- pROC::auc(roc_rf)
cat("RF PS (logit) - Test AUC:", round(as.numeric(auc_rf), 3), "\n")
## RF PS (logit) - Test AUC: 0.606
par(mfrow = c(1, 2))
plot(mu1_test_rf[A_test == 1], Y_test[A_test == 1] - mu1_test_rf[A_test == 1],
     xlab = "Predicted μ1 (RF, test)", ylab = "Residual",
     main = "Test residuals – treated (RF)")
abline(h = 0, lty = 2)

plot(mu0_test_rf[A_test == 0], Y_test[A_test == 0] - mu0_test_rf[A_test == 0],
     xlab = "Predicted μ0 (RF, test)", ylab = "Residual",
     main = "Test residuals – control (RF)")
abline(h = 0, lty = 2)

par(mfrow = c(1, 1))

Question 3: Individualized Decision Making (50 points)

We have learned several other methods for estimating the CATE. Please implement one of them (other than the X-learner and DR-learner) to estimate the CATE of promotion_flag on booking_bool (a binary outcome). Keep in mind that the goal of this question is to understand what are the important variables that affect the decision of whether to provide the promotion or not. Complete the following:

Answer:

library(dplyr)
library(ggplot2)
library(ranger)
library(tidyr)

set.seed(123)

# Covariates as in Q2
Xvars <- c(
  "time_idx","srch_length_of_stay","srch_room_count","srch_saturday_night_bool",
  "prop_location_score1","prop_location_score2","prop_log_historical_price",
  "prop_starrating","prop_review_score","comp_rate"
)

# Training data for Q3: booking_bool outcome, no revenue/clicks used
train_q3 <- traindata %>%
  dplyr::select(all_of(c(Xvars, "promotion_flag", "booking_bool"))) %>%
  tidyr::drop_na()

A_train <- train_q3$promotion_flag
Y_train <- train_q3$booking_bool
X_train <- as.matrix(train_q3[, Xvars])

# Propensity model + trimming (same spirit as Q2)
ps_form_q3 <- as.formula(paste("promotion_flag ~", paste(Xvars, collapse = " + ")))
ps_glm_q3  <- glm(ps_form_q3, data = train_q3, family = binomial)
train_q3$ps_hat <- predict(ps_glm_q3, type = "response")

keep_q3 <- train_q3$ps_hat > 0.01 & train_q3$ps_hat < 0.99
train_q3 <- train_q3[keep_q3, ]

A_train <- train_q3$promotion_flag
Y_train <- train_q3$booking_bool
X_train <- as.matrix(train_q3[, Xvars])
ps_hat_train <- train_q3$ps_hat
# Helper: tune ranger for a 0/1 outcome (binary booking_bool)
fit_ranger_tuned_bin <- function(df, y_name, Xvars,
                                 min_node_vec = c(5, 10, 20),
                                 mtry_vec = NULL,
                                 num.trees = 800) {
  if (is.null(mtry_vec)) {
    p <- length(Xvars)
    mtry_vec <- c(max(1L, floor(sqrt(p))), max(1L, floor(p / 3)))
  }
  best_mod <- NULL
  best_err <- Inf
  stats <- data.frame()
  for (mn in min_node_vec) {
    for (mt in mtry_vec) {
      mod <- ranger(
        formula = reformulate(Xvars, response = y_name),
        data    = df[, c(y_name, Xvars)],
        num.trees = num.trees,
        mtry      = mt,
        min.node.size = mn,
        sample.fraction = 0.8,
        importance      = "impurity"
      )
      err <- mod$prediction.error  # MSE for 0/1 outcome
      stats <- rbind(
        stats,
        data.frame(min.node.size = mn, mtry = mt, oob_mse = err)
      )
      if (err < best_err) {
        best_err <- err
        best_mod <- mod
      }
    }
  }
  list(model = best_mod, stats = stats)
}

Train T-learner on training data (RF base models)

# ===================== T-learner: RF base models =================== #
df_treated_q3 <- train_q3 %>%
  dplyr::filter(promotion_flag == 1) %>%
  dplyr::select(all_of(c("booking_bool", Xvars)))

df_control_q3 <- train_q3 %>%
  dplyr::filter(promotion_flag == 0) %>%
  dplyr::select(all_of(c("booking_bool", Xvars)))

tune_treated_q3 <- fit_ranger_tuned_bin(df_treated_q3, "booking_bool", Xvars)
tune_control_q3 <- fit_ranger_tuned_bin(df_control_q3, "booking_bool", Xvars)

# Tuned models for T-learner
m1_T <- tune_treated_q3$model
m0_T <- tune_control_q3$model

X_train_df <- as.data.frame(train_q3[, Xvars, drop = FALSE])

mu1_hat_train <- predict(m1_T, data = X_train_df)$predictions
mu0_hat_train <- predict(m0_T, data = X_train_df)$predictions

tau_T_train <- as.numeric(mu1_hat_train - mu0_hat_train)
ATE_T_train <- mean(tau_T_train)

cat("ATE from RF T-learner (train, booking_bool):", ATE_T_train, "\n")
## ATE from RF T-learner (train, booking_bool): 0.06615651
# CATE distribution and time pattern
p11= ggplot(data.frame(tau_T_train = tau_T_train),
       aes(x = tau_T_train)) +
  geom_histogram(bins = 30, color = "white") +
  labs(x = "Estimated CATE (T-learner, RF, train)", y = "Count")

p22= ggplot(data.frame(time_idx = train_q3$time_idx,
                  tau = tau_T_train),
       aes(time_idx, tau)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "loess") +
  labs(x = "time_idx", y = "CATE (T-learner, RF)")

vi_T1 <- sort(m1_T$variable.importance, decreasing = TRUE)
df_vi_T1 <- data.frame(
  var = factor(names(vi_T1), levels = names(vi_T1)),
  imp = as.numeric(vi_T1)
)

p33 = ggplot(df_vi_T1, aes(x = var, y = imp)) +
  geom_col() +
  coord_flip() +
  labs(x = "Covariate", y = "RF importance (A = 1 outcome model)",
       title = "Variable importance – booking_bool | A = 1")

p11+p22+p33

Test set and outcome models for indirect evaluation

# ===================== Test set for Q3 ============================ #
test_q3 <- testdata %>%
  dplyr::select(all_of(c(Xvars, "promotion_flag", "booking_bool"))) %>%
  tidyr::drop_na()

A_test      <- test_q3$promotion_flag
Y_test      <- test_q3$booking_bool
X_test_df   <- as.data.frame(test_q3[, Xvars, drop = FALSE])

# Propensity on test (diagnostic only)
ps_test_q3 <- predict(ps_glm_q3, newdata = test_q3, type = "response")

Fit separate RF outcome models on test to build an “oracle” test-side CATE for comparison:

# Outcome models fitted on test data (for indirect CATE evaluation)

df_treated_test <- test_q3 %>%
  dplyr::filter(promotion_flag == 1) %>%
  dplyr::select(all_of(c("booking_bool", Xvars)))

df_control_test <- test_q3 %>%
  dplyr::filter(promotion_flag == 0) %>%
  dplyr::select(all_of(c("booking_bool", Xvars)))

tune_treated_test <- fit_ranger_tuned_bin(df_treated_test, "booking_bool", Xvars)
tune_control_test <- fit_ranger_tuned_bin(df_control_test, "booking_bool", Xvars)
m1_test <- tune_treated_test$model
m0_test <- tune_control_test$model

mu1_hat_test_oracle <- predict(m1_test, data = X_test_df)$predictions
mu0_hat_test_oracle <- predict(m0_test, data = X_test_df)$predictions
tau_T_test_oracle   <- as.numeric(mu1_hat_test_oracle - mu0_hat_test_oracle)

# CATE on test from train-based T-learner
tau_T_test_from_train <- as.numeric(predict(m1_T, data = X_test_df)$predictions -predict(m0_T, data = X_test_df)$predictions)

# Agreement between train-based and test-based CATE on test covariates
cor_tau <- cor(tau_T_test_from_train, tau_T_test_oracle)
cat("Correlation between train-based and test-based CATE (test X):", cor_tau, "\n")
## Correlation between train-based and test-based CATE (test X): 0.2862655
# Binned calibration plot
breaks <- quantile(tau_T_test_from_train, probs = seq(0, 1, by = 0.1),
                   na.rm = TRUE)
groups <- cut(tau_T_test_from_train, breaks = breaks, include.lowest = TRUE)

calib_df <- data.frame(
  group = groups,
  tau_train = tau_T_test_from_train,
  tau_test  = tau_T_test_oracle
) %>%
  dplyr::group_by(group) %>%
  dplyr::summarise(
    mean_tau_train = mean(tau_train, na.rm = TRUE),
    mean_tau_test  = mean(tau_test, na.rm = TRUE),
    n              = dplyr::n(),
    .groups = "drop"
  )

calib_df
ggplot(calib_df,
       aes(x = mean_tau_train, y = mean_tau_test, size = n)) +
  geom_point(alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, linetype = 2) +
  labs(x = "Mean CATE (train-based T-learner, bin)",
       y = "Mean CATE (test-based T-learner, bin)",
       title = "Calibration of CATE on test data (indirect)")

Heterogeneity and policy recommendation

Train-side heterogeneity by CATE quartiles:

# Quartiles of train-based CATE
train_q3$tau_T <- tau_T_train
train_q3$cate_group <- cut(
  train_q3$tau_T,
  breaks = quantile(train_q3$tau_T, probs = c(0, 0.25, 0.5, 0.75, 1),
                    na.rm = TRUE),
  include.lowest = TRUE,
  labels = c("Q1 (lowest)", "Q2", "Q3", "Q4 (highest)")
)

# Summary by CATE quartile (training data)
hetero_tab <- train_q3 %>%
  dplyr::group_by(cate_group) %>%
  dplyr::summarise(
    mean_tau   = mean(tau_T, na.rm = TRUE),
    mean_book  = mean(booking_bool),
    mean_prom  = mean(promotion_flag),
    mean_price = mean(prop_log_historical_price),
    mean_star  = mean(prop_starrating),
    mean_loc1  = mean(prop_location_score1),
    mean_loc2  = mean(prop_location_score2),
    n          = dplyr::n(),
    .groups    = "drop"
  )

hetero_tab

Validation of heterogeneity patterns on test data using the test-side “oracle” CATE:

# Map test observations to CATE groups via train-based CATE on test X
test_q3$tau_T_hat <- tau_T_test_from_train
test_q3$cate_group <- cut(
  test_q3$tau_T_hat,
  breaks = quantile(test_q3$tau_T_hat, probs = c(0, 0.25, 0.5, 0.75, 1),
                    na.rm = TRUE),
  include.lowest = TRUE,
  labels = c("Q1 (lowest)", "Q2", "Q3", "Q4 (highest)")
)

hetero_test_tab <- test_q3 %>%
  dplyr::mutate(tau_oracle = tau_T_test_oracle) %>%
  dplyr::group_by(cate_group) %>%
  dplyr::summarise(
    mean_tau_oracle = mean(tau_oracle, na.rm = TRUE),
    mean_book       = mean(booking_bool),
    mean_prom       = mean(promotion_flag),
    n               = dplyr::n(),
    .groups         = "drop"
  )

hetero_test_tab