Precision medicine is a rapidly growing field in statistics, machine learning. It takes individual heterogeneity, environment, and lifestyles into consideration while making decisions to improve human health. Recent advances in large-scale biomedical databases and computational devices dramatically improve the prospect of optimizing an individual’s treatment. The launch of the Precision Medicine Initiative sparks much more research in discovering individualized treatment. In this tutorial, we will introduce three methods for personalized medicine: linear regression that models interactions, virtual twin model, and outcome weighted learning.
Different from a regular regression or classification problem, a personalized medicine problem involves one more type of variable: the treatment label, denoted as \(A\) (action). In the simplest case, we may assume that \(A \in \{-1, 1\}\), a binary label. However, the action can also be multiple categorical and continuous. The outcome we observe is denoted as \(R\) (reward). These notations are borrowed from a reinforcement learning literature. In a traditional medical problem, we often assume that the treatment effects (using a drug) on all subjects are the same. In certain sense, we may think about the following linear model:
\[E(R | X, A) = \boldsymbol \beta^T X + \alpha A.\] In this case, those who take treatment \(A = 1\) (e.g., taking a drug) will have \(2 \times \alpha\) higher mean reward than treatment \(A = 1\) (e.g., taking placebo). Hence, if \(\alpha > 0\), everyone should use treatment \(1\), otherwise use \(-1\). However, as medical science evolves, we know that this “one-size-fits-all” concept is not true in practice, and we should explore heterogeneous treatment effects. If we still restrict ourselves in a linear model framework, we could have
\[E(R | X, A) = \boldsymbol \beta^T X + A \cdot \boldsymbol \alpha^T X.\]
Here, the notation \(A \cdot X\) means that we are using the interaction between \(A\) and each component in \(X\). Note that \(\alpha\) also becomes a vector to incorporate all information in \(X\).
Our question in personalized medicine is: Given a patient’s covariate information \(X\), which treatment label \(A\) provides better reward?. Formally, we want to compare
\[ E(R | X, A = 1) \quad \text{vs.} \quad E(R | X, A = -1)\]
Hence, a simple strategy is to estimate the regression function and compare the expected rewards. However, we should also notice that, as long as \(\alpha^T X\) is positive, \(A = 1\) is more beneficial. Hence, this leads to two main strategies:
We will explore these two strategies with a simulation study and use Lasso to perform the analysis. In this problem, 10 variables are involved in \(\boldsymbol \beta\), while two variables are involved in the decision rule parameter \(\boldsymbol \alpha\).
set.seed(1)
n = 100
p = 200
X = matrix(rnorm(n*p), n, p)
A = rbinom(n, 1, 0.5)*2-1
beta = c(rep(0.2, 10), rep(0, p-10))
alpha = c(rep(0, p-2), 0.75, -0.75)
R = X %*% beta + A * (X %*% alpha) + rnorm(n, sd = 0.5)
testn = 500
testX = matrix(rnorm(testn*p), testn, p)
TRUE_BEST = sign(testX %*% alpha)
The idea of direct learning is very simple. Let’s just proceed with the Lasso model:
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
# fit two regression models
Apos.fit <- cv.glmnet(x = cbind(X, A)[A == 1,], y = R[A == 1])
Aneg.fit <- cv.glmnet(x = cbind(X, A)[A == -1,], y = R[A == -1])
# compare the models for new data
Apos.pred = predict(Apos.fit, newx = cbind(testX, 1))
Aneg.pred = predict(Aneg.fit, newx = cbind(testX, -1))
# the predicted treatment label
indirect.pred = sign(Apos.pred - Aneg.pred)
# the accuracy
mean(indirect.pred == TRUE_BEST)
## [1] 0.876
This seems to be fairly accurate. We got 88% of accuracy, meaning that, for around 88% of the new patients, we suggested the better treatment.
Let’s explore an interesting fact about our previous equation. This method was proposed by Tian et. al, 2014. Suppose we do not use \(R\) as our outcome of interest. Instead, we use \(AR\), the product of treatment label and the reward. Then, we have
\[\begin{align} E(AR| X) =& E( A | X) \cdot \boldsymbol \beta^T X + E(A^2 | X) \cdot \boldsymbol \alpha^T X.\\ =& \big[ P( A = 1 | X) - P(A = -1 | X) \big] \cdot \boldsymbol \beta^T X + \boldsymbol \alpha^T X \end{align}\]
If our study is collected from a randomized trial, meaning that in the observed data, the treatment decision \(A\) does not depend on the covariate information, and is balanced, i.e., \(P(A = 1 | X) = P(A = -1 | X) = 0.5\). Then the first term is zero. And we only have the second term
\[E(AR | X) = \boldsymbol \alpha^T X\] This suggests a very simple idea of directly modeling the best treatment rule.
# fitting the direct learning model
direct.fit <- cv.glmnet(x = X, y = R*A)
# direct learning predicted label
direct.pred <- sign(predict(direct.fit, testX))
# the accuracy
mean(direct.pred == TRUE_BEST)
## [1] 0.916
This time, the accuracy 92% is slightly better than the previous one.
There is an importance remark we should make is that the assumption of randomized trial is crucial. Without this assumption, the first term cannot be removed. This is very frequent in observational study. However, we can still modify the previous equation using a inverse propensity weighting,
\[\begin{align} &E\left(\frac{A}{P(A | X)}R \mid X\right) \\ =& \Big[ \frac{1}{P(A = 1 | X)}P( A = 1 | X) + \frac{-1}{P(A = -1 | X)} P(A = -1 | X) \Big] \cdot \boldsymbol \beta^T X + \\ &\quad \Big[ \frac{1^2}{P(A = 1 | X)}P(A = 1 | X) + \frac{(-1)^2}{P(A = -1 | X)} P(A = -1 | X) \Big] \boldsymbol \alpha^T X \\ =& \big[ 1 - 1 \big] \cdot \boldsymbol \beta^T X + \big[ 1 + 1 \big] \boldsymbol \alpha^T X \\ =& \,\boldsymbol 2 \alpha^T X \end{align}\]
So the first term can still be canceled out. Then, our job simply involves one more step: estimate the propensity scores \(P(A | X)\), using, e.g. a logistic regression, and then plug into the previous linear regression estimation using the subject weights \(\widehat{P}(A = a_i | X = x_i)\). This weighting idea is frequently used in the literature. We shall see this again the outcome weighted learning section.
When the observed treatment label assignment mechanism is unknown,
indirect approach can still have advantage. A popular approach called
the Virtual Twins model was proposed by Foster et al. (2011). This
is a indirect learning appraoch. The key idea is to use random forests
to learn the regression models, and use a single tree model to summarize
the decision and make it interpretable. This example using sepsis data
is mainly based on the vignettes
of the aVirtualTwins
R package. We also need the packages
for random forests and CART.
Let’s consider a simulated data set derived from the SIDES
package. The data in .csv
format can be downloaded from
our course website. In this data set, 470 patients and 14 variables are
collected. The variables are listed below.
Health
: Health outcome (larger the better)THERAPY
: 1 for active treatment, 0 for the control
treatmentTIMFIRST
: Time from first sepsis-organ fail to start
drugAGE
: Patient age in yearsBLLPLAT
: Baseline local plateletsblSOFA
: Sum of baseline sofa score (cardiovascular,
hematology, hepatorenal, and respiration scores)BLLCREAT
: Base creatinineORGANNUM
: Number of baseline organ failuresPRAPACHE
: Pre-infusion apache-ii scoreBLGCS
: Base GLASGOW coma scale scoreBLIL6
: Baseline serum IL-6 concentrationBLADL
: Baseline activity of daily living scoreBLLBILI
: Baseline local bilirubinBEST
: The true best treatment suggested by doctors.
You should not use this variable when fitting
models!For each patient, sepsis was observed during their hospital stay.
Hence, one of the two treatments (indicated by variable
THERAPY
) must be chosen to prevent further adverse events.
After the treatment, the patient’s health outcome (Health
)
was measured, with a larger value being the better outcome. The
BEST
variable is a doctor suggested best treatment, which
is not observed. This can be regarded as the unknown truth.
Run the cell below to load the data set and display the first few
rows. It will only take a few seconds to complete. Make sure that you
have the working directory setup correctly. You can do this by putting
your .rmd
file and the .csv
file in the same
folder, then open your .rmd
file to pop up RStudio.
Sepsis <- read.csv("Sepsis.csv")
# remove the first column, which is observation ID
Sepsis = Sepsis[, -1]
head(Sepsis)
We will fit two random forests to model the outcome
Health
: one model is based on all patients who received
treatment (THERAPY
) 1, and the other model is based on all
patients who received treatment 0 (corresponding to our previous label
\(-1\)). Denote these two models as
\(\widehat{f}_1(x)\) and \(\widehat f_0(x)\), respectively. There is
one more step we need to do, compared with our previous linear
regression example, that is to produce the in-sample prediction of all
training data based on these two models. This is because we need to use
these training data predictions to learn an interpretable tree model in
Step 2. The key mechanism we use here is the out-of-bag predictions from
random forests. This prevents over-fitting of in-sample observations. In
the meantime, we should also tune both mtry
and
nodesize
, but that step will be skipped in the
demonstration. Read the following code.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
# fit model for treatment 0
model0 <- randomForest(Health ~ . - BEST,
data = Sepsis[Sepsis$THERAPY == 0, ],
ntree=2000,
mtry = 5,
nodesize = 5)
# fit model for treatment 1
model1 <- randomForest(Health ~ . - BEST,
data = Sepsis[Sepsis$THERAPY == 1, ],
ntree=2000,
mtry = 5,
nodesize = 5)
# out-of-bag prediction for treatment 0 group with model 0
model0.treat0 = model0$predicted
# out-of-bag prediction for treatment 1 group with model 1
model1.treat1 = model1$predicted
# prediction for treatment 1 group with model 0
model0.treat1 = predict(model0, Sepsis[Sepsis$THERAPY == 1, ])
# prediction for treatment 0 group with model 1
model1.treat0 = predict(model1, Sepsis[Sepsis$THERAPY == 0, ])
# combine predictions together
pred.treat0 = rep(NA, nrow(Sepsis))
pred.treat0[Sepsis$THERAPY == 0] = model0.treat0
pred.treat0[Sepsis$THERAPY == 1] = model0.treat1
pred.treat1 = rep(NA, nrow(Sepsis))
pred.treat1[Sepsis$THERAPY == 0] = model1.treat0
pred.treat1[Sepsis$THERAPY == 1] = model1.treat1
# which is better?
pred.best = (pred.treat1 > pred.treat0)
# is this good?
mean(pred.best == Sepsis$BEST)
## [1] 0.8148936
So it looks like we got pretty good accuracy, 81%. But random forest is not interpretable, and a doctor cannot directly know why the patient was assigned to a certain treatment.
In this second step, we will construct a single-tree model (CART) to represent the choice of best treatment. The idea is very simple, use all of our covariates to model the best label we learned.
library(rpart)
# fit a decision tree to predict pred.best,
# excluding BEST, Health, and THERAPY from the dataset
best.label = ifelse(pred.treat1 > pred.treat0, "Treatment 1", "Treatment 0")
best.label = as.factor(best.label)
rpart.fit <- rpart(best.label ~ . - BEST - Health - THERAPY, data = Sepsis)
# in the coming cells, we will prune the tree
# start by plotting the cross-validation relative error at different tree sizes
plotcp(rpart.fit)
Decide a cp
value, and we can view the fitted tree
model.
# cut the tree
cutted.tree = prune(rpart.fit, cp = 0.04)
library(rpart.plot)
# make a good plot
rpart.plot(cutted.tree)
Hence, the conclusion is that, when \(\text{Age} < 52\), we should use treatment 1. While for \(\text{Age} \geq 52\), if the apache score is less than 32, we should use treatment 0 and otherwise treatment 1.
Outcome Weighted Learning (Zhao et. al, 2012) is a direct learning approach. The logic of outcome weighted learning is a bit more tricky, since involves dealing with the probabilities \(P(A = 1 | X)\) and \(P(A = -1 | X)\), when it is not a randomized trial. We often call this observational data. Making inference on observational data is what casual inference mainly concerns. Formally, let’s be precise on what is our target of estimation. In Qian and Murphy (2011), we first define the value function of a decision rule \(d(x)\):
\[\begin{align} {\cal V}(d) =& E^d(R)\\ =& E_X\big[ E_R[R \mid X, A = d(X)] \big] \end{align}\]
This means that we first (inner expectation \(E_R\)) let everyone take the treatment label suggested by the decision rule \(d\), i.e., \(A = d(X)\), and obtain the expected reward, and then average over the entire population (\(E_X\)). Hence, our goal is to maximize \({\cal V}(d)\) by searching for the best \(d(X)\)
\[ d_\text{opt} = \underset{d}{\arg\max} \,\, {\cal V}(d) \]
However, we do not observe the distribution that everyone follows the suggested \(d(X)\) treatment. Instead, our observational study is collected under its own mechanism, or distribution, denoted as \((R, X, A) \sim \mu\). Our target is to estimate the value function under a restricted policy that gives \((R, X, A = d(X)) \sim \mu^d\). A tool we could utilize is the Radon-Nikodym theorem, which suggests that
\[\begin{align} {\cal V}(d) =& \int R d\mu^d \\ =& \int R \frac{d\mu^d}{d\mu}d\mu \\ =& \int \frac{R \cdot 1\{A = d(X)\}}{P(A | X)} d\mu \end{align}\]
Now, since \(d\mu\) is observed, we can use the sample (empirical) estimation of this value function.
\[\widehat{d}_\text{opt} = \underset{d}{\arg\max} \frac{1}{n} \sum_{i=1}^n \frac{R_i \cdot 1\{A_i = d(X_i)\} }{\widehat{P}(A_i | X_i)}.\] We could specify \(d(X_i)\) as a linear function, or nonlinear function using the Reproducing Kernel Hilbert Space \({\cal H}\), and that would require adding a penalty term:
\[\widehat{d}_\text{opt} = \underset{d}{\arg\max} \frac{1}{n} \sum_{i=1}^n \frac{R_i \cdot 1\{A_i = d(X_i)\} }{\widehat{P}(A_i | X_i)} - \lambda \| d\|_{\cal H}^2.\] Lastly, we can change this formulation to minimization by switching the equal sign to unequal sign:
\[\widehat{d}_\text{opt} = \underset{d}{\arg\min} \frac{1}{n} \sum_{i=1}^n \frac{R_i \cdot 1\{A_i \neq d(X_i)\} }{\widehat{P}(A_i | X_i)} + \lambda \| d\|_{\cal H}^2.\] The interesting view of this formulation is that, this becomes a classification problem, where * \(d(X_i)\) is the decision rule * \(A_i\) is the class label * \(\frac{R_i}{\widehat{P}(A_i | X_i)}\) is the subject specific weight
Hence, our goal is to perform classification to model the labels, but we want to encourage the model to do better job for those with high rewards. Let’s look at an example, where we assume that \(P(A | X) = 1/2\).
set.seed(1)
n = 800
p = 2
x1 <- runif(n, 0, 1)
x2 <- runif(n, 0, 1)
a = rbinom(n, 1, 0.5)*2-1
side = sign(x2 - sin(2*pi*x1)/3-0.5)
R <- rnorm(n, mean = ifelse(side==a, 1.5, 0.5), sd = 0.5)
# the observed data, with class labels
par(mar=rep(1,4))
plot(x1, x2, pch = 19, xaxt = "n", yaxt = "n", xlab = "", ylab = "",
col = ifelse(a==1, "deepskyblue", "darkorange"))
legend("topright", c("Treatment = 1", "Treatment = -1"), pch = 19,
col = c("deepskyblue", "darkorange"), cex = 1.3)
Now lets add the reward as weights, indicated by the size. Its clear that on the lower side, treatment \(-1\) is more dominating and one the upper side, \(1\) is more dominating. The true decision line is also given below.
par(mfrow = c(1, 2))
par(mar=rep(1,4))
plot(x1, x2, pch = 19, xaxt = "n", yaxt = "n", xlab = "", ylab = "",
cex = (R+1.5)/3, col = ifelse(a==1, "deepskyblue", "darkorange"))
legend("topright", c("Treatment = 1", "Treatment = -1"), pch = 19,
col = c("deepskyblue", "darkorange"), cex = 1.3)
plot(x1, x2, pch = 19, xaxt = "n", yaxt = "n", xlab = "", ylab = "",
cex = (R+1.5)/3, col = ifelse(a==1, "deepskyblue", "darkorange"))
legend("topright", c("Treatment = 1", "Treatment = -1"), pch = 19,
col = c("deepskyblue", "darkorange"), cex = 1.3)
x1_grid = sort(x1)
lines(x1_grid, sin(2*pi*x1_grid)/3+0.5, lwd = 3)
Let’s use an R package to solve this problem. We first try a linear kernel.
# you need to install the package from github
# devtools::install_github("cran/DTRlearn")
library(DTRlearn)
## Loading required package: kernlab
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:scales':
##
## alpha
## Loading required package: MASS
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:kernlab':
##
## alpha
## The following object is masked from 'package:randomForest':
##
## margin
par(mfrow = c(1, 1))
# fit OWL using a linear kernel
owl.fit = wsvm(X = cbind(x1, x2), A = a, wR = R-min(R), kernel = "linear")
# plot results
xgrid1 = seq(0, 1, 0.01)
xgrid2 = seq(0, 1, 0.01)
mesh = expand.grid(xgrid1, xgrid2)
colnames(mesh) = c("x1", "x2")
meshlabel = predict(owl.fit, data.matrix(mesh))
# the decision line
contour(xgrid1, xgrid2, xaxt = "n", yaxt = "n", lwd = 2,
matrix(meshlabel, length(xgrid1), length(xgrid2)), levels = c(0.5))
points(mesh, pch=".", cex=1.2,
col=ifelse(meshlabel==1, "deepskyblue", "darkorange"))
points(x1, x2, pch = 19, cex = (R+0.5)/1.5,
col = ifelse(a==1, "deepskyblue", "darkorange"))
Clearly, this is not ideal. Hence, we can try rbf kernel
owl.fit = wsvm(X = cbind(x1, x2), A = a, wR = R-min(R), kernel = "rbf", sigma = 2)
meshlabel = predict(owl.fit, data.matrix(mesh))
# the decision line
contour(xgrid1, xgrid2, xaxt = "n", yaxt = "n", lwd = 2,
matrix(meshlabel, length(xgrid1), length(xgrid2)), levels = c(0.5))
points(mesh, pch=".", cex=1.2,
col=ifelse(meshlabel==1, "deepskyblue", "darkorange"))
points(x1, x2, pch = 19, cex = (R+0.5)/1.5,
col = ifelse(a==1, "deepskyblue", "darkorange"))
We can also use random forests to perform this weighted classification.
# fit OWL using a random forest
library(RLT)
## RLT and Random Forests v4.2.6
## pre-release at github.com/teazrq/RLT
owl.fit = RLT(cbind(x1, x2), as.factor(a), model = "classification",
obs.w = R - min(R) + 0.1, ntrees = 1000,
resample.prob = 0.75, resample.replace = FALSE,
nmin = 45, mtry = 2)
meshlabel = predict(owl.fit, mesh)$Prediction
# owl.fit = wsvm(X = cbind(x1, x2), A = a, wR = R-min(R), kernel = "rbf", sigma = 1.5)
# meshlabel = predict(owl.fit, data.matrix(mesh))
# the decision line
contour(xgrid1, xgrid2, xaxt = "n", yaxt = "n", lwd = 2,
matrix(meshlabel, length(xgrid1), length(xgrid2)), levels = c(0.5))
points(mesh, pch=".", cex=1.2,
col=ifelse(meshlabel==1, "deepskyblue", "darkorange"))
points(x1, x2, pch = 19, cex = (R+0.5)/1.5,
col = ifelse(a==1, "deepskyblue", "darkorange"))
Personalized medicine does not just end here. There are many settings we need to consider, for example when the decision is continuous (Chen, et. al, 2016), subject to censoring (Cui, et. al, 2022), dynamic decisions (Chakraborty and Murphy, 2014). And the problem eventually falls into the domain of reinforcement learning when there are a very large number of decisions to make.