Stat 546 Homework 5 Solution

Homework Description
Remark
Question 1: Validity of Causal Inference (20 points)
Question 2: The X-learner and DR-learner (50 points)
Question 3: Individualized Decision Making (50 points)

Homework Description

This homework contains a mini project for individualized decision making and conditional average treatment effect estimation. We will use a real dataset and apply the methods learned in class to analyze the data and draw conclusions. Some parts of this homework can be open ended. You are encouraged to explore and be creative. Some questions may also be computationally intensive, depends on your choices of methods and parameters. Hence you may decide your approach accordingly, as long as you can answer the questions and justify your choices.

The dataset we will be using comes from Kaggle Personalize Expedia Hotel Searches. Please be careful that our research goal is different from the competition goal, and you should read the description of the data variables carefully before you proceed. Here is what we want to do:

Our TA Mehrdad has prepared a processed version of the data for us. The original data is more than 1 GB, hence we provided a small subset of the data using the following approach. The processed data can be found at his GitHub repository here.
- We sampled 600 unique srch_destination_id from the original training data, and for each destination id, we sampled one prop_id (hotel) and included all observations for that (destination \(\times\) hotel) combination. This is to avoid interference as much as possible since multiple hotels of the same search are naturally competing against each other, making some of our assumptions invalid.
- For this subset of the data, we want to consider the binary variable promotion_flag as the treatment variable.
- Continuous variables are already scaled in the data, hence you do not need to do any variable transformation for them.
- There are three outcome variables we can consider: click_bool, gross_bookings_usd, and booking_bool which indicates the results of the search.
- The goal of our homework is to understand the causal effect of promotion_flag

The following code loads the data.

rm(list = ls())
traindata = read.csv("expedia_train_rl.csv")
colnames(traindata)

##  [1] "prop_id"                   "srch_destination_id"      
##  [3] "time_idx"                  "srch_length_of_stay"      
##  [5] "srch_room_count"           "srch_saturday_night_bool" 
##  [7] "prop_location_score1"      "prop_location_score2"     
##  [9] "prop_log_historical_price" "prop_review_score"        
## [11] "prop_starrating"           "random_bool"              
## [13] "position"                  "price_usd"                
## [15] "promotion_flag"            "gross_bookings_usd"       
## [17] "booking_bool"              "click_bool"               
## [19] "comp_rate"                 "comp_inv"

testdata = read.csv("expedia_test_rl.csv")

Remark

Question 1: Validity of Causal Inference (20 points)

We have learned many assumptions related to causal inference and conditional average treatment effects in class. After carefully reading the documentation of this data, do you think this dataset satisfies assumptions we need? Please discuss each assumption that you think are relevant to our problem (CATE estimation and individualized decision making), and whether you think the assumption is likely to hold or not. You should provide justifications for your answers. Keep in mind that regardless of the validity of the assumptions, we will still proceed to analyze the data in later questions, and moving forward, we will be treat each row of this data as independent observations.

Answer:

Assumption	Role for CATE / policy	How to check / diagnose
SUTVA – no interference	Outcome depends only on own treatment	Argue from design that units do not interact. Check repeated IDs / clusters.
SUTVA – consistency	Observed Y equals potential Y under received A	Treatment must be clearly defined (0/1). Outcomes correctly recorded.
Unconfoundedness	\(A ⟂ \{Y(0),Y(1)\} \mid X\)	Diagnose indirectly via covariate imbalance. Check balance before/after weighting. Argue about missing covariates.
Positivity / overlap	\(0 < P(A=1\mid X=x) < 1\)	Estimate propensity scores and inspect overlap. Check tails and IPW weight distribution.
_i.i.d. sampling*	Observations independent & same distribution	Look for clustering by IDs and time trends. Check outcome or treatment vs time.
Correct PS model	Needed for IPW / DR validity	Check AUC and calibration. Most important: covariate balance after weighting.
Correct outcome model	Needed for regression-based CATE and DR	Residual diagnostics and CV predictive performance. Compare flexible vs linear models.

SUTVA (structure of the sample)

In the real Expedia platform, users compare many hotels within the same search session, so a promotion for one hotel can affect the relative attractiveness and ranking of competing hotels. By construction, however, the HW dataset restricts each destination to a single hotel, largely removing within-search competition and making the no-interference part of SUTVA more plausible for this homework.

(Note: In the data (destination_ID, prop_ID) pair defines a unique hotel. There are hotels in different geographical locations with the same property ID in the Expedia data. This is the reason for below length(unique(prop_id) smaller than Unique destinations.)

## Number of (destination, hotel) pairs and repetitions
with(traindata, {
  cat("Unique destinations:", length(unique(srch_destination_id)), "\n")
  cat("Unique hotels:", length(unique(prop_id)),"\n")
  tab <- table(srch_destination_id, prop_id)
  cat("Max times a (dest,hotel) pair appears:", max(tab), "\n")
})

## Unique destinations: 600 
## Unique hotels: 587 
## Max times a (dest,hotel) pair appears: 600

## consistency sanity checks
stopifnot(all(traindata$promotion_flag %in% 0:1),
          all(traindata$booking_bool   %in% 0:1),
          all(traindata$click_bool     %in% 0:1))
table(traindata$booking_bool,traindata$gross_bookings_usd > 0,useNA = "ifany")

##    
##     FALSE TRUE
##   0  5959    0
##   1     0  378

print(paste("Number of rows (booking = 0 and revenue > 0)=", sum(traindata$booking_bool == 0 & traindata$gross_bookings_usd > 0, na.rm = TRUE)))

## [1] "Number of rows (booking = 0 and revenue > 0)= 0"

In above the summary table column is booking_bool. When booking_bool = 0, there are 5959 rows with zero revenue and 0 rows with positive revenue. When booking_bool = 1, there are 378 rows with positive revenue and 0 with zero revenue.

Unconfoundedness – covariate balance diagnostics

We assume ((Y(1),Y(0)) A X), with (X) the pre-treatment search and hotel covariates. Some confounders (stay length, room count, etc), are set before the promotion decision. However, promotion might uses additional unobserved signals. Below Love plott displays absolute standardized mean differences (SMDs) for each covariate before any adjustment between the treated group (promotion_flag = 1) and the control group (promotion_flag = 0). It visualizes baseline imbalance and shows which covariates differ the most across treatment groups. The dashed line is the standard balance threshold, usually (0.1):

library(dplyr)
library(cobalt)# for bal.tab and love.plot

treat_var <- "promotion_flag"
covs <- c("srch_length_of_stay","srch_room_count","srch_saturday_night_bool",
          "prop_location_score1","prop_location_score2","prop_log_historical_price",
          "prop_starrating","prop_review_score","comp_rate")

form_cov <- as.formula(paste(treat_var, "~", paste(covs, collapse = " + ")))
bal      <- bal.tab(form_cov, data = traindata, un = TRUE)
print(bal)

## Balance Measures
##                              Type Diff.Un
## srch_length_of_stay       Contin.  0.3904
## srch_room_count           Contin.  0.0287
## srch_saturday_night_bool   Binary -0.0936
## prop_location_score1      Contin.  0.8284
## prop_location_score2      Contin.  0.2920
## prop_log_historical_price Contin.  0.1715
## prop_starrating_4          Binary -0.0811
## prop_review_score         Contin. -0.4642
## comp_rate                 Contin.  0.1453
## 
## Sample sizes
##     Control Treated
## All    4568    1769

love.plot(bal,stats="m",abs=TRUE, thresholds=c(m = 0.1),var.order="unadjusted",
          stars= "raw",title= "Standardized mean differences before adjustment",
          line = TRUE)

Many covariates have SMDs between 0.3 and 0.8, meaning very strong imbalance.
The most imbalanced variables are prop_location_score1, prop_review_score, prop_location_score2,prop_log_historical_price, and srch_length_of_stay.
All of these lie well beyond the 0.1 threshold line.

Therefore, before weighting/matching, treated and control units differ substantially in key covariates; naïve comparisons of booking outcomes would be biased.

Positivity (overlap)

Positivity requires a non-zero chance of both receiving and not receiving a promotion.

library(ggplot2)
ps_form  <- as.formula(paste("promotion_flag ~", paste(covs, collapse = " + ")))
ps_glm   <- glm(ps_form, data = traindata, family = binomial)
traindata$ps_hat <- predict(ps_glm, type = "response")
mean(traindata$ps_hat < 0.01)

## [1] 0

mean(traindata$ps_hat > 0.99)

## [1] 0

ggplot(traindata, aes(ps_hat, fill = factor(promotion_flag))) +
  geom_histogram(position = "identity", bins = 30, alpha = 0.5) +
  labs(x = "Propensity score", fill = "promotion_flag")

i.i.d. sampling

The rows are not truly independent and identically distributed because multiple rows correspond to the same hotel–destination pair and the marketplace evolves over time_idx (seasonality, policy changes). This induces within-cluster dependence and non-stationarity. For the homework, we follow the instruction to treat each row as an independent observation.

The i.i.d. assumption is only an approximation in this dataset. Relative to unconfoundedness and SUTVA, minor departures from i.i.d. are a less serious threat here, but in principle cluster-robust SEs or hierarchical models would be preferable.

library(dplyr)
library(patchwork)

cluster_sizes <- traindata %>%
  count(srch_destination_id, prop_id, name = "n_obs") %>%
  arrange(desc(n_obs))

p1 <- ggplot(traindata, aes(time_idx, booking_bool)) +
  geom_smooth(se = FALSE) +
  ggtitle("Booking rate over time_idx")

p2 <- ggplot(traindata, aes(time_idx, promotion_flag)) +
  geom_smooth(se = FALSE) +
  ggtitle("Promotion probability over time_idx")

p1 + p2

In the plots show clear time trends:

Booking probability drops sharply from early periods and then slowly rises.
Promotion probability increases steadily over time and stabilizes at a higher level.

These patterns imply: 1. Observations are not identically distributed — early searches seems differ systematically from later ones (seasonality, marketplace changes, policy shifts). 2. The promotion policy itself evolves over time, which creates time-driven confounding.

Strong time dynamics violates strict independence/identical distribution, but for the homework we proceed by treating each search row as an independent unit.

Question 2: The X-learner and DR-learner (50 points)

One of the easiest ways to estimate CATE is the X-learner. Please implement the X-learner to estimate the CATE of promotion_flag on gross_bookings_usd.

You need to implement at least two methods: one linear model (any type) and one nonparametric model (any type), for the regression steps in the X-learner. You should provide a justification of your choices of methods.
You may need to do some data preprocessing, for example, not all variables are relevant for the analysis. Only include the ones that are meaningful. Provide a justifications for the variables you removed.
How are you treating the discrete and continuous variables in your models? Nominal or ordinal? Please explain your choices.
Once you have the conditional average treatment effects (CATE) estimated for each observation, provide a summary for the average treatment effect (ATE) and discuss if the treatment is beneficial on average.
The DR-learner is another popular method for CATE estimation. Please also implement the DR-learner (see algorithm 1, page 3019 of the original paper) to estimate the CATE and then summarize the ATE for this sampled population. Discussion briefly how the DR-learner is different from the X-learner and why it is able to achieve some additional theoretical properties. What could be the drawbacks of the DR-learner compared to the X-learner in practice?

Answer:

The analysis is restricted to observations with a completed booking, since is defined only in that subset. All continuous covariates were treated as numeric and binary variables as 0/1 indicators; only pre-treatment hotel and search features were included.

For the X-learner, a ridge regression model was used as the linear estimator to control coefficient shrinkage in the presence of correlated predictors, and a tuned random forest was used as a nonparametric estimator to capture nonlinear and interaction effects. Out-of-sample diagnostics showed that the treated-arm linear outcome model produced negative test \(R^2\), indicating that a linear specification is inadequate, while the random forest achieved substantially lower test MSE and higher \(R^2\), suggesting meaningful nonlinearity in the data.

The resulting CATE estimates showed that linear X-learner ATEs were essentially zero, whereas the random forest X-learner yielded an average effect of approximately \(-0.07\) on the log-revenue scale. The DR-learner, which forms a doubly robust pseudo-outcome and is theoretically less sensitive to nuisance model misspecification, produced similar conclusions: near-zero ATE under linear models and a small negative ATE (about \(-0.03\) to \(-0.04\)) under random forest models.

Overall, the better-fitting nonparametric models consistently indicated that promotions do not increase log revenue among bookers and may slightly reduce it, with effects varying across hotel characteristics and time. The poor linear fit in the treated arm further suggests the presence of nonlinear structure that requires flexible learners.

X-Learner

We first restrict to rows with a booking and log-transform revenue: \(Y = \log\{1 + \text{gross_bookings_usd}\}, \qquad A \in \{0,1\}, \qquad X \in \mathbb{R}^p\)

We then assume a working regression model in each arm: \(Y \mid (A=a, X=x) = \mu_a(x) + \varepsilon_a, \qquad \mu_a(x) \approx x^\top \beta_a\)

and estimate \(\hat\mu_1(x),\hat\mu_0(x)\) by CV-Ridge using pre-treatment covariates only.

X-learner construction:

The target CATE is \(\tau(x) = \mathbb{E}[Y(1)-Y(0)\mid X=x]\) The X-learner proceeds in three analytic steps:

impute counterfactual outcomes using the outcome models: \(\hat Y^{(0)}_i = \hat\mu_0(X_i),\quad \hat Y^{(1)}_i = \hat\mu_1(X_i)\).
Construct pseudo-effects within each arm: \(D^{(1)}_i = Y_i - \hat\mu_0(X_i)\quad \text{for }A_i=1, \quad D^{(0)}_i = \hat\mu_1(X_i) - Y_i\quad \text{for }A_i=0.\)

We winsorize these to reduce extreme values:

\(\tilde D^{(1)}_i = \min\{\max(D^{(1)}_i, q^{(1)}_{0.01}), q^{(1)}_{0.99}\}, \qquad \tilde D^{(0)}_i = \min\{\max(D^{(0)}_i, q^{(0)}_{0.01}), q^{(0)}_{0.99}\}.\)

Learn CATE functions in each arm by regressing pseudo-effects on (X): \(\hat\tau_1(x) = x^\top \hat\gamma_1 \approx \mathbb{E}[\tilde D^{(1)} \mid X=x, A=1],\qquad \hat\tau_0(x) = x^\top \hat\gamma_0 \approx \mathbb{E}[\tilde D^{(0)} \mid X=x, A=0]\)

We also estimate a propensity score \(\hat e(x) = \mathbb{P}(A=1\mid X=x)\)

Finally, the X-learner CATE combines both arms: \(\hat\tau_X(x) = (1-\hat e(x))\hat\tau_1(x) + \hat e(x)\hat\tau_0(x)\) and the (log-scale) ATE is \(\widehat{\text{ATE}} = \frac{1}{n}\sum_{i=1}^n \hat\tau_X(X_i)\).

# ====================== Setup & preprocessing ====================== #
library(dplyr)
library(glmnet)
library(ranger)
library(ggplot2)

set.seed(123)

dat <- traindata %>%mutate(Y_log = log1p(gross_bookings_usd))
# filter(booking_bool == 1) %>%
Xvars <- c(
  "time_idx","srch_length_of_stay","srch_room_count","srch_saturday_night_bool",
  "prop_location_score1","prop_location_score2","prop_log_historical_price",
  "prop_starrating","prop_review_score","comp_rate"
)

A <- dat$promotion_flag
Y <- dat$Y_log
X <- as.matrix(dat[, Xvars])

# ================== Single PS model + trimming ===================== #
ps_form <- as.formula(paste("promotion_flag ~", paste(Xvars, collapse = " + ")))
ps_glm_full <- glm(ps_form, data = dat, family = binomial)
dat$ps_hat_full <- predict(ps_glm_full, type = "response")

keep <- dat$ps_hat_full > 0.01 & dat$ps_hat_full < 0.99
dat  <- dat[keep, ]
A    <- dat$promotion_flag
Y    <- dat$Y_log
X    <- as.matrix(dat[, Xvars])
ps_hat <- dat$ps_hat_full   # PS for linear X-learner

X-learner with linear models

# ================= X-learner: ridge + linear ======================= #

# treated regression μ1
fit1    <- cv.glmnet(X[A == 1, ], Y[A == 1], alpha = 0)   # ridge
mu1_hat <- as.vector(predict(fit1, X, s = "lambda.min"))

# control regression μ0
fit0    <- cv.glmnet(X[A == 0, ], Y[A == 0], alpha = 0)
mu0_hat <- as.vector(predict(fit0, X, s = "lambda.min"))

# pseudo-effects
D1 <- Y[A == 1] - mu0_hat[A == 1]
D0 <- mu1_hat[A == 0] - Y[A == 0]

# winsorization (stabilize pseudo-effects)
q1 <- quantile(D1, c(.01, .99)); D1_w <- pmin(pmax(D1, q1[1]), q1[2])
q0 <- quantile(D0, c(.01, .99)); D0_w <- pmin(pmax(D0, q0[1]), q0[2])

X_treat <- as.data.frame(dat[A == 1, Xvars, drop = FALSE])
X_ctrl  <- as.data.frame(dat[A == 0, Xvars, drop = FALSE])

tau1_mod <- lm(D1_w ~ ., data = cbind(D1_w = D1_w, X_treat))
tau0_mod <- lm(D0_w ~ ., data = cbind(D0_w = D0_w, X_ctrl))

tau1_hat <- predict(tau1_mod, newdata = dat)
tau0_hat <- predict(tau0_mod, newdata = dat)

tau_x <- (1 - ps_hat) * tau1_hat + ps_hat * tau0_hat
ATE_x <- mean(tau_x)
ATE_x

## [1] 0.04131178

p1 = ggplot(data.frame(tau_x), aes(tau_x)) +
  geom_histogram(bins = 30, color = "white", fill = "steelblue") +
  labs(x = "CATE (X-learner, Ridge)", y = "Count")

p2 = ggplot(data.frame(t = dat$time_idx, tau = tau_x),
       aes(t, tau)) +
  geom_point(alpha = .2) +
  geom_smooth(method = "loess") +
  labs(x = "time_idx", y = "CATE")
p1+p2

Diagnostic

par(mfrow = c(1, 2))
plot(mu1_hat[A == 1], Y[A == 1] - mu1_hat[A == 1],
     xlab = "Predicted mu1", ylab = "Residual", main = "Treated residuals")
abline(h = 0, lty = 2)

plot(mu0_hat[A == 0], Y[A == 0] - mu0_hat[A == 0],
     xlab = "Predicted mu0", ylab = "Residual", main = "Control residuals")
abline(h = 0, lty = 2)

par(mfrow = c(1, 1))

X-learner with random forest

# ================== RF hyperparameter tuning (global) ============== #
fit_ranger_tuned <- function(df, y_name, Xvars,
                             min_node_vec = c(5, 10, 20),
                             mtry_vec = NULL,
                             num.trees = 800) {
  if (is.null(mtry_vec)) {
    p <- length(Xvars)
    mtry_vec <- c(max(1L, floor(sqrt(p))), max(1L, floor(p / 3)))
  }
  best_mod <- NULL
  best_err <- Inf
  stats <- data.frame()
  for (mn in min_node_vec) {
    for (mt in mtry_vec) {
      mod <- ranger(
        formula = reformulate(Xvars, response = y_name),
        data    = df[, c(y_name, Xvars)],
        num.trees = num.trees,
        mtry      = mt,
        min.node.size = mn,
        sample.fraction = 0.8,
        importance      = "impurity"
      )
      err <- mod$prediction.error
      stats <- rbind(
        stats,
        data.frame(min.node.size = mn, mtry = mt, oob_mse = err)
      )
      if (err < best_err) {
        best_err <- err
        best_mod <- mod
      }
    }
  }
  list(model = best_mod, stats = stats)
}
df_treated_full <- dat %>%
  dplyr::filter(promotion_flag == 1) %>%
  dplyr::select(Y_log, dplyr::all_of(Xvars))
df_control_full <- dat %>%
  dplyr::filter(promotion_flag == 0) %>%
  dplyr::select(Y_log, dplyr::all_of(Xvars))


tune_treated  <- fit_ranger_tuned(df_treated_full, "Y_log", Xvars)
tune_control  <- fit_ranger_tuned(df_control_full, "Y_log", Xvars)

# Extract best hyperparameters from tuning stats (robust)
idx_treated  <- which.min(tune_treated$stats$oob_mse)
idx_control  <- which.min(tune_control$stats$oob_mse)

best_treated_mn <- as.integer(tune_treated$stats$min.node.size[idx_treated])
best_treated_mt <- as.integer(tune_treated$stats$mtry[idx_treated])

best_control_mn <- as.integer(tune_control$stats$min.node.size[idx_control])
best_control_mt <- as.integer(tune_control$stats$mtry[idx_control])

# ================== K-fold cross-fitted μ1, μ0, and PS ============= #
K <- 5
n <- nrow(dat)
fold_id <- sample(rep(1:K, length.out = n))

mu1_hat_cf <- rep(NA_real_, n)
mu0_hat_cf <- rep(NA_real_, n)
ps_hat_cf  <- rep(NA_real_, n)

for (k in 1:K) {
  idx_train <- which(fold_id != k)
  idx_test  <- which(fold_id == k)

  dat_tr <- dat[idx_train, ]
  dat_te <- dat[idx_test, ]

  X_te <- dat_te[, Xvars, drop = FALSE]

  # cross-fitted PS
  ps_glm_k <- glm(ps_form, data = dat_tr, family = binomial)
  ps_hat_cf[idx_test] <- predict(ps_glm_k, newdata = dat_te, type = "response")

  # RF outcome models on treated / control separately (train only)
  df_tr_treated <- dat_tr %>%
    dplyr::filter(promotion_flag == 1) %>%
    dplyr::select(Y_log, dplyr::all_of(Xvars))
  
  df_tr_control <- dat_tr %>%
    dplyr::filter(promotion_flag == 0) %>%
    dplyr::select(Y_log, dplyr::all_of(Xvars))


  m1_rf_k <- ranger(
    formula = reformulate(Xvars, "Y_log"),
    data    = df_tr_treated,
    num.trees = 800,
    mtry      = best_treated_mt,
    min.node.size = best_treated_mn,
    sample.fraction = 0.8,
    importance      = "impurity"
  )

  m0_rf_k <- ranger(
    formula = reformulate(Xvars, "Y_log"),
    data    = df_tr_control,
    num.trees = 800,
    mtry      = best_control_mt,
    min.node.size = best_control_mn,
    sample.fraction = 0.8,
    importance      = "impurity"
  )

  mu1_hat_cf[idx_test] <- predict(m1_rf_k, data = X_te)$predictions
  mu0_hat_cf[idx_test] <- predict(m0_rf_k, data = X_te)$predictions
}

# ================== Pseudo-effects + stabilization ================= #
D1_all <- rep(NA_real_, n)
D0_all <- rep(NA_real_, n)

D1_all[A == 1] <- Y[A == 1] - mu0_hat_cf[A == 1]
D0_all[A == 0] <- mu1_hat_cf[A == 0] - Y[A == 0]

q1_rf <- quantile(D1_all[A == 1], c(0.01, 0.99), na.rm = TRUE)
q0_rf <- quantile(D0_all[A == 0], c(0.01, 0.99), na.rm = TRUE)

D1_rf_w <- D1_all
D0_rf_w <- D0_all

D1_rf_w[A == 1] <- pmin(pmax(D1_all[A == 1], q1_rf[1]), q1_rf[2])
D0_rf_w[A == 0] <- pmin(pmax(D0_all[A == 0], q0_rf[1]), q0_rf[2])

# ================== RF CATE models (τ1, τ0) ======================== #
X_df <- as.data.frame(dat[, Xvars, drop = FALSE])

df_tau1 <- data.frame(D1_rf_w = D1_rf_w[A == 1],
                      X_df[A == 1, , drop = FALSE])
df_tau0 <- data.frame(D0_rf_w = D0_rf_w[A == 0],
                      X_df[A == 0, , drop = FALSE])

tau1_rf_mod <- ranger(
  D1_rf_w ~ ., data = df_tau1,
  num.trees = 800,
  mtry      = best_treated_mt,
  min.node.size = best_treated_mn,
  sample.fraction = 0.8,
  importance      = "impurity"
)

tau0_rf_mod <- ranger(
  D0_rf_w ~ ., data = df_tau0,
  num.trees = 800,
  mtry      = best_control_mt,
  min.node.size = best_control_mn,
  sample.fraction = 0.8,
  importance      = "impurity"
)

tau1_hat_rf <- predict(tau1_rf_mod, data = X_df)$predictions
tau0_hat_rf <- predict(tau0_rf_mod, data = X_df)$predictions

# ================== Final X-learner CATE + ATE ===================== #
tau_x_rf_cf <- (1 - ps_hat_cf) * tau1_hat_rf + ps_hat_cf * tau0_hat_rf
ATE_x_rf_cf <- mean(tau_x_rf_cf)

cat("ATE (log-scale) from cross-fitted RF X-learner:", ATE_x_rf_cf, "\n")

## ATE (log-scale) from cross-fitted RF X-learner: 0.08097908

# ================== Some diagnostics / plots ======================= #
p1 = ggplot(data.frame(tau_x_rf_cf = tau_x_rf_cf),
       aes(x = tau_x_rf_cf)) +
  geom_histogram(bins = 30, color = "white") +
  labs(x = "Estimated CATE (X-learner, RF, cross-fitted)", y = "Count")

p2 = ggplot(data.frame(time_idx = dat$time_idx, tau = tau_x_rf_cf),
       aes(time_idx, tau)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "loess") +
  labs(x = "time_idx", y = "CATE (RF X-learner, cross-fitted)")

vi1 <- sort(tau1_rf_mod$variable.importance, decreasing = TRUE)
df_vi1 <- data.frame(
  var = factor(names(vi1), levels = names(vi1)),
  imp = as.numeric(vi1)
)

p3 = ggplot(df_vi1, aes(x = var, y = imp)) +
  geom_col() +
  coord_flip() +
  labs(x = "Covariate", y = "RF importance (treated CATE model)",
       title = "Variable importance – tau1_rf_mod (treated)")

p1+p2+p3

DR Learner

DR-learner builds a single pseudo-outcome \[\phi(X_i) = \mu_1(X_i)-\mu_0(X_i)+ \frac{A_i{Y_i-\mu_1(X_i)}}{e(X_i)}-\frac{(1-A_i){Y_i-\mu_0(X_i)}}{1-e(X_i)}\] and then regresses \(\phi\) on \(X\).

If outcome models and PS are estimated well (with cross-fitting), this is Neyman-orthogonal and achieves good rates; it’s more robust but can be higher-variance than X-learner in finite samples.

Compared to the X-learner, the DR-learner uses a single doubly-robust pseudo-outcome that combines both outcome and propensity models. With cross-fitted nuisance estimates, it is Neyman-orthogonal and can attain faster convergence rates, but in finite samples it can be more variable and more sensitive to extreme propensity scores, so clipping and winsorization are useful in practice.

DR-learner with linear models

# ================= DR-learner: linear base models ================== #
# uses: mu1_hat, mu0_hat, ps_hat, A, Y, X, Xvars from previous chunks

# avoid extreme PS in denominators (already trimmed, but clip a bit)
ps_lin_clip <- pmin(pmax(ps_hat, 0.02), 0.98)

# DR pseudo-outcomes
phi_dr_lin <- (mu1_hat - mu0_hat) +
  A * (Y - mu1_hat) / ps_lin_clip -
  (1 - A) * (Y - mu0_hat) / (1 - ps_lin_clip)

# small winsorization of pseudo-outcomes for stability
q_dr_lin <- quantile(phi_dr_lin, c(0.01, 0.99), na.rm = TRUE)
phi_dr_lin_w <- pmin(pmax(phi_dr_lin, q_dr_lin[1]), q_dr_lin[2])

# DR CATE model: ridge regression of phi on X
fit_dr_lin <- cv.glmnet(X, phi_dr_lin_w, alpha = 0)
tau_dr_lin_hat <- as.vector(predict(fit_dr_lin, X, s = "lambda.min"))
ATE_dr_lin <- mean(tau_dr_lin_hat)

cat("DR Learner Linear ATE = ", ATE_dr_lin)

## DR Learner Linear ATE =  -0.01175858

p1 = ggplot(data.frame(tau_dr_lin_hat = tau_dr_lin_hat),
       aes(tau_dr_lin_hat)) +
  geom_histogram(bins = 30, color = "white", fill = "steelblue") +
  labs(x = "CATE (DR-learner, linear)", y = "Count")

p2 = ggplot(data.frame(time_idx = dat$time_idx, tau = tau_dr_lin_hat),
       aes(time_idx, tau)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "loess") +
  labs(x = "time_idx", y = "CATE (DR, linear)")
p1+p2

DR-learner with non-linear models

# ============= DR-learner: RF base models (cross-fitted) =========== #
# uses: mu1_hat_cf, mu0_hat_cf, ps_hat_cf, A, Y, dat, Xvars, fit_ranger_tuned

# clip cross-fitted PS
ps_rf_clip <- pmin(pmax(ps_hat_cf, 0.02), 0.98)

# DR pseudo-outcomes (RF version)
phi_dr_rf <- (mu1_hat_cf - mu0_hat_cf) +
  A * (Y - mu1_hat_cf) / ps_rf_clip -
  (1 - A) * (Y - mu0_hat_cf) / (1 - ps_rf_clip)
# winsorize DR pseudo-outcomes
q_dr_rf <- quantile(phi_dr_rf, c(0.01, 0.99), na.rm = TRUE)
phi_dr_rf_w <- pmin(pmax(phi_dr_rf, q_dr_rf[1]), q_dr_rf[2])

X_df <- as.data.frame(dat[, Xvars, drop = FALSE])
df_dr <- data.frame(phi_dr_rf_w = phi_dr_rf_w, X_df)

# (optional) tune RF specifically for DR CATE model
cat("=== Tuning RF for DR-learner CATE model ===\n")

## === Tuning RF for DR-learner CATE model ===

tune_dr <- fit_ranger_tuned(df_dr, "phi_dr_rf_w", Xvars)

# extract best hyperparameters
idx_dr <- which.min(tune_dr$stats$oob_mse)
best_dr_mn <- as.integer(tune_dr$stats$min.node.size[idx_dr])
best_dr_mt <- as.integer(tune_dr$stats$mtry[idx_dr])

# final DR CATE RF model
tau_dr_rf_mod <- ranger(
  phi_dr_rf_w ~ ., data = df_dr,
  num.trees      = 800,
  mtry           = best_dr_mt,
  min.node.size  = best_dr_mn,
  sample.fraction = 0.8,
  importance     = "impurity"
)

tau_dr_rf_hat <- predict(tau_dr_rf_mod, data = X_df)$predictions
ATE_dr_rf <- mean(tau_dr_rf_hat)

cat("DR Learner Random Forest ATE = ",ATE_dr_rf)

## DR Learner Random Forest ATE =  0.04208706

p1= ggplot(data.frame(tau_dr_rf_hat = tau_dr_rf_hat),
       aes(tau_dr_rf_hat)) +
  geom_histogram(bins = 30, color = "white") +
  labs(x = "CATE (DR-learner, RF)", y = "Count")

p2 = ggplot(data.frame(time_idx = dat$time_idx, tau = tau_dr_rf_hat),
       aes(time_idx, tau)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "loess") +
  labs(x = "time_idx", y = "CATE (DR, RF)")

# variable importance for DR CATE RF
vi_dr <- sort(tau_dr_rf_mod$variable.importance, decreasing = TRUE)
df_vi_dr <- data.frame(
  var = factor(names(vi_dr), levels = names(vi_dr)),
  imp = as.numeric(vi_dr)
)

p3= ggplot(df_vi_dr, aes(x = var, y = imp)) +
  geom_col() +
  coord_flip() +
  labs(x = "Covariate", y = "RF importance (DR CATE model)",
       title = "Variable importance – DR-learner RF")
p1+p2+p3

Out of Sample Analysis

# ================= Test set preprocessing ========================= #
test_dat <- testdata %>%
  dplyr::filter(booking_bool == 1) %>%
  dplyr::mutate(Y_log = log1p(gross_bookings_usd))

X_test   <- as.matrix(test_dat[, Xvars])
Y_test   <- test_dat$Y_log
A_test   <- test_dat$promotion_flag
# reuse the same PS model as in training
ps_test  <- predict(ps_glm_full, newdata = test_dat, type = "response")

Linear nuisances

# ========== Out-of-sample diagnostics: linear nuisances =========== #

# Outcome models μ1, μ0 on test
mu1_test_lin <- as.vector(predict(fit1, newx = X_test, s = "lambda.min"))
mu0_test_lin <- as.vector(predict(fit0, newx = X_test, s = "lambda.min"))

mse1_lin <- mean((Y_test[A_test == 1] - mu1_test_lin[A_test == 1])^2)
mse0_lin <- mean((Y_test[A_test == 0] - mu0_test_lin[A_test == 0])^2)

r2_lin_1 <- 1 - mse1_lin / var(Y_test[A_test == 1])
r2_lin_0 <- 1 - mse0_lin / var(Y_test[A_test == 0])

cat("Linear μ1  - Test MSE:", round(mse1_lin, 4),
    "  R^2:", round(r2_lin_1, 3), "\n")

## Linear μ1  - Test MSE: 1.419   R^2: -5.275

cat("Linear μ0  - Test MSE:", round(mse0_lin, 4),
    "  R^2:", round(r2_lin_0, 3), "\n")

## Linear μ0  - Test MSE: 1.5333   R^2: -7.368

# Propensity score on test
library(pROC)
roc_lin <- pROC::roc(A_test, ps_test)
auc_lin <- pROC::auc(roc_lin)
cat("Linear PS  - Test AUC:", round(as.numeric(auc_lin), 3), "\n")

## Linear PS  - Test AUC: 0.606

ggplot(data.frame(ps_test = ps_test,
                  A_test = factor(A_test)),
       aes(ps_test, fill = A_test)) +
  geom_histogram(position = "identity", bins = 30, alpha = 0.5) +
  labs(x = "Test propensity score", fill = "promotion_flag",
       title = "PS overlap on test (linear model)")

RF nuisances

# ========== Out-of-sample diagnostics: RF nuisances =============== #

# Outcome models μ1, μ0 on test from tuned RFs
X_test_df <- as.data.frame(test_dat[, Xvars, drop = FALSE])

mu1_test_rf <- predict(tune_treated$model, data = X_test_df)$predictions
mu0_test_rf <- predict(tune_control$model, data = X_test_df)$predictions

mse1_rf <- mean((Y_test[A_test == 1] - mu1_test_rf[A_test == 1])^2)
mse0_rf <- mean((Y_test[A_test == 0] - mu0_test_rf[A_test == 0])^2)

r2_rf_1 <- 1 - mse1_rf / var(Y_test[A_test == 1])
r2_rf_0 <- 1 - mse0_rf / var(Y_test[A_test == 0])

cat("RF μ1     - Test MSE:", round(mse1_rf, 4),
    "  R^2:", round(r2_rf_1, 3), "\n")

## RF μ1     - Test MSE: 1.4367   R^2: -5.353

cat("RF μ0     - Test MSE:", round(mse0_rf, 4),
    "  R^2:", round(r2_rf_0, 3), "\n")

## RF μ0     - Test MSE: 1.4439   R^2: -6.88

# PS: same logistic model used as base for cross-fitted PS
roc_rf <- pROC::roc(A_test, ps_test)
auc_rf <- pROC::auc(roc_rf)
cat("RF PS (logit) - Test AUC:", round(as.numeric(auc_rf), 3), "\n")

## RF PS (logit) - Test AUC: 0.606

par(mfrow = c(1, 2))
plot(mu1_test_rf[A_test == 1], Y_test[A_test == 1] - mu1_test_rf[A_test == 1],
     xlab = "Predicted μ1 (RF, test)", ylab = "Residual",
     main = "Test residuals – treated (RF)")
abline(h = 0, lty = 2)

plot(mu0_test_rf[A_test == 0], Y_test[A_test == 0] - mu0_test_rf[A_test == 0],
     xlab = "Predicted μ0 (RF, test)", ylab = "Residual",
     main = "Test residuals – control (RF)")
abline(h = 0, lty = 2)

par(mfrow = c(1, 1))

Question 3: Individualized Decision Making (50 points)

We have learned several other methods for estimating the CATE. Please implement one of them (other than the X-learner and DR-learner) to estimate the CATE of promotion_flag on booking_bool (a binary outcome). Keep in mind that the goal of this question is to understand what are the important variables that affect the decision of whether to provide the promotion or not. Complete the following:

You should use almost the same set of data you used in the previous question. Do not include the outcome variable gross_bookings_usd or click_bool in your analysis. If you choose to do some additional data preprocessing, please provide justifications for your choices.
Implement the method you chose to estimate the CATE. You should provide a brief description of the method and how you choose the tuning parameters (if any).
Now, we will utilize the testing data. Keep in mind that the testing data also only contain the results for one treatment arm (either promotion or no promotion). Hence, we cannot directly evaluate the effect of treatment on each of the testing sample. However, if we can estimate the outcome models (make a good choice) on the testing data, we may still evaluate the performance of our CATE estimates indirectly. Carry out this idea (provide a clear description of your implementation) to evaluate if your estimated CATE matches the estimated CATE on the testing data. Provide figures or tables to summarize and discuss your analysis results.
A researcher is wondering what are the important variables that affect the decision making and based on your analysis, would you recommend Expedia to offer promotions universally or selectively? To answer this question, please provide some additional investigation to see what type of hotels (a potential subset, if exist) would likely to benefit most from the promotion. Again, how could you validate your findings using the testing data? Please provide figures or tables to summarize and discuss your results.