So far we have discussed estimating the average treatment effect (ATE). However, knowing the ATE only provides a population statement of which treatment is better on average. In practice, we are often interested in knowing whether a treatment is better for a given individual.1 Precisely, we are interested in possibly two quantities, the conditional average treatment effect (CATE):
\[ \tau(x) = \E[Y(1) - Y(0) | X = x] \]
and the optimal treatment regime:
\[ \begin{aligned} \pi_\text{opt}(x) &= \underset{\pi}{\arg\max} \,\, \E\left[Y(\pi(X)) | X = x\right] \\ &= \mathbf{1}\{\tau(x) > 0\} \end{aligned} \]
where \(\pi\) is a treatment assignment function that takes the patient’s covariate \(X\) as the input. The best of all such functions, \(\pi_\text{opt}(x)\), which makes the best decision for all individuals, is the called the optimal treatment regime. In this section, we will discuss the estimation of the optimal treatment regime. Sometimes they involve estimating the CATE \(\tau(x)\), sometimes they directly estimate the decision rule. Those who estimate the CATE and infer the decision rule are called indirect learning, while those who directly estimate the decision rule are called direct learning.
The most straight forward idea is an indirect learning approach. Here, we would always assume that there is no unmeasured confounding. Hence, the approach involves two steps:
A typical example of this type of approach is called the Virtual Twins model (Foster, Taylor, and Ruberg 2011). It uses random forests to learn the regression functions in the first step, and in the second step, it uses a single tree model to summarize the decision rule and make it interpretable. This example using sepsis data is mainly based on the vignettes of the aVirtualTwins
R package. We also need the packages for random forests and CART.
Let’s consider a simulated data set derived from the SIDES package. The data in .csv
format can be downloaded from our course website. In this data set, 470 patients and 14 variables are collected. The variables are listed below.
Health
: Health outcome (larger the better)THERAPY
: 1 for active treatment, 0 for the control treatmentTIMFIRST
: Time from first sepsis-organ fail to start drugAGE
: Patient age in yearsBLLPLAT
: Baseline local plateletsblSOFA
: Sum of baseline sofa score (cardiovascular, hematology, hepatorenal, and respiration scores)BLLCREAT
: Base creatinineORGANNUM
: Number of baseline organ failuresPRAPACHE
: Pre-infusion apache-ii scoreBLGCS
: Base GLASGOW coma scale scoreBLIL6
: Baseline serum IL-6 concentrationBLADL
: Baseline activity of daily living scoreBLLBILI
: Baseline local bilirubinBEST
: The true best treatment suggested by doctors. You should not use this variable when fitting models!For each patient, sepsis was observed during their hospital stay. Hence, one of the two treatments (indicated by variable THERAPY
) must be chosen to prevent further adverse events. After the treatment, the patient’s health outcome (Health
) was measured, with a larger value being the better outcome. The BEST
variable is a doctor suggested best treatment, which is not observed. This can be regarded as the unknown truth.
Run the cell below to load the data set and display the first few rows. It will only take a few seconds to complete. Make sure that you have the working directory setup correctly. You can do this by putting your .rmd
file and the .csv
file in the same folder, then open your .rmd
file to pop up RStudio.
Sepsis <- read.csv("..//dataset//Sepsis.csv")
# remove the first column, which is observation ID
Sepsis = Sepsis[, -1]
head(Sepsis)
We will fit two random forests to model the outcome Health
: one model is based on all patients who received treatment (THERAPY
) 1, and the other model is based on all patients who received treatment 0 (corresponding to our previous label \(-1\)). Denote these two models as \(\hat f_1(x)\) and \(\hat f_0(x)\), respectively. Here, we will also use a little trick in random forest, which is the out-of-bag prediction. This is similar to a leave-one-out type of approach and the fitted value of a subject is calculated from the trees that do not use this subject. For more details, you may refer to this lecture note. This mechanic effectively prevents over-fitting to certain extend. In the meantime, we should also tune both mtry
and nodesize
, but that step will be skipped in the demonstration. The following code demonstates this idea:
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
# fit model for treatment 0
model0 <- randomForest(Health ~ . - BEST,
data = Sepsis[Sepsis$THERAPY == 0, ],
ntree=2000,
mtry = 5,
nodesize = 5)
# fit model for treatment 1
model1 <- randomForest(Health ~ . - BEST,
data = Sepsis[Sepsis$THERAPY == 1, ],
ntree=2000,
mtry = 5,
nodesize = 5)
# out-of-bag prediction for treatment 0 group with model 0
model0.treat0 = model0$predicted
# out-of-bag prediction for treatment 1 group with model 1
model1.treat1 = model1$predicted
# prediction for treatment 1 group with model 0
model0.treat1 = predict(model0, Sepsis[Sepsis$THERAPY == 1, ])
# prediction for treatment 0 group with model 1
model1.treat0 = predict(model1, Sepsis[Sepsis$THERAPY == 0, ])
# combine predictions together
pred.treat0 = rep(NA, nrow(Sepsis))
pred.treat0[Sepsis$THERAPY == 0] = model0.treat0
pred.treat0[Sepsis$THERAPY == 1] = model0.treat1
pred.treat1 = rep(NA, nrow(Sepsis))
pred.treat1[Sepsis$THERAPY == 0] = model1.treat0
pred.treat1[Sepsis$THERAPY == 1] = model1.treat1
# which is better?
pred.best = (pred.treat1 > pred.treat0)
# is this good?
mean(pred.best == Sepsis$BEST)
## [1] 0.812766
So it looks like we got pretty good accuracy, 81%. This is simply taking the optimal decision rule as the one that has a higher predicted outcome based on the twin random forest models. But random forest is a very complex model, and is not interpretable. Hence a doctor cannot easily know what is the better treatment.
This second step can be optional, but very useful in practice if simple decision rule is prefered. We will construct a single-tree model (CART) to represent the optimal decision rule \(\pi_\text{opt}(x)\). The idea is very simple, we will create an artificial label, best.label
, which is 1 if pred.treat1
is larger than pred.treat0
, and 0 otherwise. Then we will use all of our covariates to model the best.label
we learned.
library(rpart)
# fit a decision tree to predict pred.best,
# excluding BEST, Health, and THERAPY from the dataset
best.label = ifelse(pred.treat1 > pred.treat0, "Treatment 1", "Treatment 0")
best.label = as.factor(best.label)
rpart.fit <- rpart(best.label ~ . - BEST - Health - THERAPY, data = Sepsis)
# in the coming cells, we will prune the tree
# start by plotting the cross-validation relative error at different tree sizes
plotcp(rpart.fit)
After some tuning of the tree model, using the cp
parameter, we can view the fitted tree model:
# cut the tree
cutted.tree = prune(rpart.fit, cp = 0.04)
library(rpart.plot)
# make a good plot
rpart.plot(cutted.tree)
Hence, the conclusion is that, when \(\text{Age} < 52\), we should use treatment 1. While for \(\text{Age} \geq 52\), if the apache score is less than 32, we should use treatment 0 and otherwise treatment 1.
We can of course fit a linear regression model with interactions between the treatment label and the covariates. If we simply do that and use the fitted model to predict the best treatment, we are doing indirect learning.
\[ \E[Y | X, A] = \boldsymbol \beta^T X + \boldsymbol \alpha^T X \cdot A \]
The following code demonstrate this idea using a simulated data set and the Lasso model. Please note that we will be using \(A \in \{-1, 1\}\) as the treatment label since it will benifit our later development.
set.seed(1)
n = 100
p = 200
X = matrix(rnorm(n*p), n, p)
A = rbinom(n, 1, 0.5)*2-1
beta = c(rep(0.2, 10), rep(0, p-10))
alpha = c(rep(0, p-2), 0.75, -0.75)
R = X %*% beta + A * (X %*% alpha) + rnorm(n, sd = 0.5)
testn = 500
testX = matrix(rnorm(testn*p), testn, p)
TRUE_BEST = sign(testX %*% alpha)
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
# fit two regression models
Apos.fit <- cv.glmnet(x = cbind(X, A)[A == 1,], y = R[A == 1])
Aneg.fit <- cv.glmnet(x = cbind(X, A)[A == -1,], y = R[A == -1])
# compare the models for new data
Apos.pred = predict(Apos.fit, newx = cbind(testX, 1))
Aneg.pred = predict(Aneg.fit, newx = cbind(testX, -1))
# the predicted treatment label
indirect.pred = sign(Apos.pred - Aneg.pred)
# the accuracy
mean(indirect.pred == TRUE_BEST)
## [1] 0.876
This seems to be fairly accurate. We got 88% of accuracy, meaning that, for around 88% of the new patients, we suggested the better treatment.
But let’s also explore an interesting fact about our previous equation. This method was proposed by Tian et al. (2014). Suppose we do not use \(Y\) as our outcome of interest. Instead, we use \(AY\), the product of treatment label and the outcome Then, we have
\[ \begin{aligned} \E(AY| X) =& \E( A | X) \cdot \boldsymbol \beta^T X + \E(A^2 | X) \cdot \boldsymbol \alpha^T X.\\ =& \big[ P( A = 1 | X) - P(A = -1 | X) \big] \cdot \boldsymbol \beta^T X + \boldsymbol \alpha^T X \end{aligned} \]
If our study is collected from a balenced randomized trial, i.e., \(\Pr(A = 1 | X) = P(A = -1 | X) = 0.5\), then the first term is zero. And we only have the second term
\[ \E(AY | X) = \boldsymbol \alpha^T X \]
This suggests a very simple idea for directly modeling the best treatment rule using this modified outcome:
# fitting the direct learning model
direct.fit <- cv.glmnet(x = X, y = R*A)
# direct learning predicted label
direct.pred <- sign(predict(direct.fit, testX))
# the accuracy
mean(direct.pred == TRUE_BEST)
## [1] 0.916
This time, the accuracy 92% is slightly better than the previous one.
As we can expect, the way to deal with observational study is to use the propensity score weighting. We can modify the previous equation using a inverse propensity weight. Of course we would also assume suitable conditions as we did in previous sections.
\[ \begin{aligned} & \E \left( \frac{A}{\Pr( A | X)}Y \Biggm| X \right) \\ =& \left[ \frac{1}{\Pr( A | X)}\Pr( A = 1 | X) + \frac{-1}{\Pr( A = -1 | X)} \Pr( A = -1 | X) \right] \cdot \bbeta^T X + \\ &\quad \left[ \frac{1^2}{\Pr( A = 1| X)}\Pr( A = 1| X) + \frac{(-1)^2}{\Pr( A = -1| X)} \Pr( A = -1| X) \right] \balpha^T X \\ =& \left[ 1 - 1 \right] \cdot \bbeta^T X + \left[ 1 - 1 \right] \balpha^T X \\ =& 2 \balpha^T X \end{aligned} \]
So the first term can still be canceled out. Then, our job simply involves one more step: estimate the propensity scores \(\Pr( A | X)\), using, e.g. logistic regression or random forests, and then plug into the previous linear regression estimation using the subject weights \(\widehat{\Pr}(A = a_i | X = x_i)\).
We sometimes also interested in the average treatment effect of a specific subgroup of individuals: \(E[ Y(1) - Y(0) | G = 1]\) where \(G\) is some indicator of a specific characteristics. This concept is called Local Average Treatment Effect (LATE)↩︎