Overview

So far we have discussed estimating the average treatment effect (ATE). However, knowing the ATE only provides a population statement of which treatment is better on average. In practice, we are often interested in knowing whether a treatment is better for a given individual.¹ Precisely, we are interested in possibly two quantities, the conditional average treatment effect (CATE):

\[ \tau(x) = \E[Y(1) - Y(0) | X = x] \]

and the optimal treatment regime:

\[ \begin{aligned} \pi_\text{opt}(x) &= \underset{\pi}{\arg\max} \,\, \E\left[Y(\pi(X)) | X = x\right] \\ &= \mathbf{1}\{\tau(x) > 0\} \end{aligned} \]

where \(\pi\) is a treatment assignment function that takes the patient’s covariate \(X\) as the input. The best of all such functions, \(\pi_\text{opt}(x)\), which makes the best decision for all individuals, is the called the optimal treatment regime. In this section, we will discuss the estimation of the optimal treatment regime. Sometimes they involve estimating the CATE \(\tau(x)\), sometimes they directly estimate the decision rule. Those who estimate the CATE and infer the decision rule are called indirect learning, while those who directly estimate the decision rule are called direct learning.

Regression Approach: The Virtual Twins

The most straight forward idea is an indirect learning approach. Here, we would always assume that there is no unmeasured confounding. Hence, the approach involves two steps:

In the first step, we learn the regression functions \(\E[Y | X, A = 1]\) and \(\E[Y | X, A = 0]\).
In the second step, we compare the two regression functions to decide which treatment is better.

A typical example of this type of approach is called the Virtual Twins model (Foster, Taylor, and Ruberg 2011). It uses random forests to learn the regression functions in the first step, and in the second step, it uses a single tree model to summarize the decision rule and make it interpretable. This example using sepsis data is mainly based on the vignettes of the aVirtualTwins R package. We also need the packages for random forests and CART.

Let’s consider a simulated data set derived from the SIDES package. The data in .csv format can be downloaded from our course website. In this data set, 470 patients and 14 variables are collected. The variables are listed below.

Health: Health outcome (larger the better)
THERAPY: 1 for active treatment, 0 for the control treatment
TIMFIRST: Time from first sepsis-organ fail to start drug
AGE: Patient age in years
BLLPLAT: Baseline local platelets
blSOFA: Sum of baseline sofa score (cardiovascular, hematology, hepatorenal, and respiration scores)
BLLCREAT: Base creatinine
ORGANNUM: Number of baseline organ failures
PRAPACHE: Pre-infusion apache-ii score
BLGCS: Base GLASGOW coma scale score
BLIL6: Baseline serum IL-6 concentration
BLADL: Baseline activity of daily living score
BLLBILI: Baseline local bilirubin
BEST: The true best treatment suggested by doctors. You should not use this variable when fitting models!

For each patient, sepsis was observed during their hospital stay. Hence, one of the two treatments (indicated by variable THERAPY) must be chosen to prevent further adverse events. After the treatment, the patient’s health outcome (Health) was measured, with a larger value being the better outcome. The BEST variable is a doctor suggested best treatment, which is not observed. This can be regarded as the unknown truth.

Run the cell below to load the data set and display the first few rows. It will only take a few seconds to complete. Make sure that you have the working directory setup correctly. You can do this by putting your .rmd file and the .csv file in the same folder, then open your .rmd file to pop up RStudio.

  Sepsis <- read.csv("..//dataset//Sepsis.csv")
  # remove the first column, which is observation ID
  Sepsis = Sepsis[, -1]
  head(Sepsis)

Step 1: Outcome Regressions

We will fit two random forests to model the outcome Health: one model is based on all patients who received treatment (THERAPY) 1, and the other model is based on all patients who received treatment 0 (corresponding to our previous label \(-1\)). Denote these two models as \(\hat f_1(x)\) and \(\hat f_0(x)\), respectively. Here, we will also use a little trick in random forest, which is the out-of-bag prediction. This is similar to a leave-one-out type of approach and the fitted value of a subject is calculated from the trees that do not use this subject. For more details, you may refer to this lecture note. This mechanic effectively prevents over-fitting to certain extend. In the meantime, we should also tune both mtry and nodesize, but that step will be skipped in the demonstration. The following code demonstates this idea:

  library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.

  # fit model for treatment 0
  model0 <- randomForest(Health ~ . - BEST, 
                         data = Sepsis[Sepsis$THERAPY == 0, ], 
                         ntree=2000,
                         mtry = 5, 
                         nodesize = 5)
  
  # fit model for treatment 1
  model1 <- randomForest(Health ~ . - BEST, 
                         data = Sepsis[Sepsis$THERAPY == 1, ], 
                         ntree=2000,
                         mtry = 5, 
                         nodesize = 5)
  
  # out-of-bag prediction for treatment 0 group with model 0
  model0.treat0 = model0$predicted
  
  # out-of-bag prediction for treatment 1 group with model 1
  model1.treat1 = model1$predicted
  
  # prediction for treatment 1 group with model 0
  model0.treat1 = predict(model0, Sepsis[Sepsis$THERAPY == 1, ])
  
  # prediction for treatment 0 group with model 1
  model1.treat0 = predict(model1, Sepsis[Sepsis$THERAPY == 0, ])

  # combine predictions together 
  pred.treat0 = rep(NA, nrow(Sepsis))
  pred.treat0[Sepsis$THERAPY == 0] = model0.treat0
  pred.treat0[Sepsis$THERAPY == 1] = model0.treat1
  
  pred.treat1 = rep(NA, nrow(Sepsis))
  pred.treat1[Sepsis$THERAPY == 0] = model1.treat0
  pred.treat1[Sepsis$THERAPY == 1] = model1.treat1  
  
  # which is better?
  pred.best = (pred.treat1 > pred.treat0)
  
  # is this good?
  mean(pred.best == Sepsis$BEST)
## [1] 0.812766

So it looks like we got pretty good accuracy, 81%. This is simply taking the optimal decision rule as the one that has a higher predicted outcome based on the twin random forest models. But random forest is a very complex model, and is not interpretable. Hence a doctor cannot easily know what is the better treatment.

Step 2: A Tree Decision Rule

This second step can be optional, but very useful in practice if simple decision rule is prefered. We will construct a single-tree model (CART) to represent the optimal decision rule \(\pi_\text{opt}(x)\). The idea is very simple, we will create an artificial label, best.label, which is 1 if pred.treat1 is larger than pred.treat0, and 0 otherwise. Then we will use all of our covariates to model the best.label we learned.

  library(rpart)

  # fit a decision tree to predict pred.best, 
  # excluding BEST, Health, and THERAPY from the dataset
  best.label = ifelse(pred.treat1 > pred.treat0, "Treatment 1", "Treatment 0")
  best.label = as.factor(best.label)
  
  rpart.fit <- rpart(best.label ~ . - BEST - Health - THERAPY, data = Sepsis)
  
  # in the coming cells, we will prune the tree
  # start by plotting the cross-validation relative error at different tree sizes
  plotcp(rpart.fit)

After some tuning of the tree model, using the cp parameter, we can view the fitted tree model:

  # cut the tree
  cutted.tree = prune(rpart.fit, cp = 0.04)

  library(rpart.plot)
  # make a good plot
  rpart.plot(cutted.tree)

Hence, the conclusion is that, when \(\text{Age} < 52\), we should use treatment 1. While for \(\text{Age} \geq 52\), if the apache score is less than 32, we should use treatment 0 and otherwise treatment 1.

Direct Learning with Linear Regression

We can of course fit a linear regression model with interactions between the treatment label and the covariates. If we simply do that and use the fitted model to predict the best treatment, we are doing indirect learning.

\[ \E[Y | X, A] = \boldsymbol \beta^T X + \boldsymbol \alpha^T X \cdot A \]

The following code demonstrate this idea using a simulated data set and the Lasso model. Please note that we will be using \(A \in \{-1, 1\}\) as the treatment label since it will benifit our later development.

  set.seed(1)
  n = 100
  p = 200
  X = matrix(rnorm(n*p), n, p)
  A = rbinom(n, 1, 0.5)*2-1
  beta = c(rep(0.2, 10), rep(0, p-10))
  alpha = c(rep(0, p-2), 0.75, -0.75)
  R = X %*% beta + A * (X %*% alpha) + rnorm(n, sd = 0.5)
  
  testn = 500
  testX = matrix(rnorm(testn*p), testn, p)
  TRUE_BEST = sign(testX %*% alpha)

  library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8

  # fit two regression models
  Apos.fit <- cv.glmnet(x = cbind(X, A)[A == 1,], y = R[A == 1])
  Aneg.fit <- cv.glmnet(x = cbind(X, A)[A == -1,], y = R[A == -1])
  
  # compare the models for new data
  Apos.pred = predict(Apos.fit, newx = cbind(testX, 1))
  Aneg.pred = predict(Aneg.fit, newx = cbind(testX, -1))
  
  # the predicted treatment label
  indirect.pred = sign(Apos.pred - Aneg.pred)
  
  # the accuracy
  mean(indirect.pred == TRUE_BEST)
## [1] 0.876

This seems to be fairly accurate. We got 88% of accuracy, meaning that, for around 88% of the new patients, we suggested the better treatment.

Case 1: Randomized Trial

But let’s also explore an interesting fact about our previous equation. This method was proposed by Tian et al. (2014). Suppose we do not use \(Y\) as our outcome of interest. Instead, we use \(AY\), the product of treatment label and the outcome Then, we have

\[ \begin{aligned} \E(AY| X) =& \E( A | X) \cdot \boldsymbol \beta^T X + \E(A^2 | X) \cdot \boldsymbol \alpha^T X.\\ =& \big[ P( A = 1 | X) - P(A = -1 | X) \big] \cdot \boldsymbol \beta^T X + \boldsymbol \alpha^T X \end{aligned} \]

If our study is collected from a balenced randomized trial, i.e., \(\Pr(A = 1 | X) = P(A = -1 | X) = 0.5\), then the first term is zero. And we only have the second term

\[ \E(AY | X) = \boldsymbol \alpha^T X \]

This suggests a very simple idea for directly modeling the best treatment rule using this modified outcome:

  # fitting the direct learning model
  direct.fit <- cv.glmnet(x = X, y = R*A)

  # direct learning predicted label
  direct.pred <- sign(predict(direct.fit, testX))
  
  # the accuracy
  mean(direct.pred == TRUE_BEST)  
## [1] 0.916

This time, the accuracy 92% is slightly better than the previous one.

Case 2: Observational Study

As we can expect, the way to deal with observational study is to use the propensity score weighting. We can modify the previous equation using a inverse propensity weight. Of course we would also assume suitable conditions as we did in previous sections.

\[ \begin{aligned} & \E \left( \frac{A}{\Pr( A | X)}Y \Biggm| X \right) \\ =& \left[ \frac{1}{\Pr( A | X)}\Pr( A = 1 | X) + \frac{-1}{\Pr( A = -1 | X)} \Pr( A = -1 | X) \right] \cdot \bbeta^T X + \\ &\quad \left[ \frac{1^2}{\Pr( A = 1| X)}\Pr( A = 1| X) + \frac{(-1)^2}{\Pr( A = -1| X)} \Pr( A = -1| X) \right] \balpha^T X \\ =& \left[ 1 - 1 \right] \cdot \bbeta^T X + \left[ 1 - 1 \right] \balpha^T X \\ =& 2 \balpha^T X \end{aligned} \]

So the first term can still be canceled out. Then, our job simply involves one more step: estimate the propensity scores \(\Pr( A | X)\), using, e.g. logistic regression or random forests, and then plug into the previous linear regression estimation using the subject weights \(\widehat{\Pr}(A = a_i | X = x_i)\).

Reference

Foster, Jared C, Jeremy MG Taylor, and Stephen J Ruberg. 2011. “Subgroup Identification from Randomized Clinical Trial Data.” Statistics in Medicine 30 (24): 2867–80.

Tian, Lu, Ash A Alizadeh, Andrew J Gentles, and Robert Tibshirani. 2014. “A Simple Method for Estimating Interactions Between a Treatment and a Large Number of Covariates.” Journal of the American Statistical Association 109 (508): 1517–32.

We sometimes also interested in the average treatment effect of a specific subgroup of individuals: \(E[ Y(1) - Y(0) | G = 1]\) where \(G\) is some indicator of a specific characteristics. This concept is called Local Average Treatment Effect (LATE)↩︎

Optimal Treatment Regime: Direct vs. Indirect Learning

Ruoqing Zhu

Last Updated: March 26, 2024