We have now seen several examples of estimating the optimal treatment regime. In this section, our goal is to explore several issues and topics around this problem. Hopefully they can provide some guidelines for you to think about when you are working on your own project.
Outcome Weighted Learning (Zhao et al. 2012) is probably one of the most popular direct learning approach. Formally, let’s be precise on what is our target of estimation. Following (Qian and Murphy 2011), let us define a quantity called the value function of a decision rule \(\pi(x)\). In fact, this is also a concept in reinforcement learning, and it is more commonly to use \(R\) (reward) to denote the outcome rather than using \(Y\). Hence we will follow the notation in the literature. The value function is defined as follows:
\[ \begin{aligned} {\cV}(d) =& \E^\pi(R)\\ =& \E_X\big[ \E_R[R \mid X, A = \pi(X)] \big] \end{aligned} \]
This means that we first (inner expectation \(E_R\) ) let everyone take the treatment label suggested by the decision rule \(\pi\), i.e., \(A = \pi(X)\), and obtain the expected reward, and then average over the entire population (\(E_X\)). Hence, our goal is to maximize \({\cV}(d)\) by searching for the best \(\pi(X)\)
\[ \pi_\text{opt} = \underset{\pi}{\arg\max} \,\, {\cV}(d) \]
Note that this is the same as our previous definition of the optimal treatment regime, since if a rule maximizes the outcome for each individual, it would also maximize the population average. However, a tricky problem here is that we do not observe the distribution that everyone follows the suggested \(\pi(X)\) treatment. Instead, our observational study is collected under its own mechanism, or distribution, denoted as \((R, X, A) \sim \mu\). This is called a behavior policy. On the other hand, our target is to estimate the value function under a restricted policy that gives \((R, X, A = \pi(X)) \sim \mu^\pi\). A tool we could utilize is the Radon-Nikodym theorem, which suggests that
\[ \begin{aligned} \cV(\pi) =& \int R d\mu^\pi \\ =& \int R \frac{d\mu^\pi}{d\mu}d\mu \\ =& \int \frac{R \cdot \mathbf{1}\{A = \pi(X)\}}{\Pr(A | X)} d\mu \end{aligned} \]
Now, since \(d\mu\) is observed, we can use the sample (empirical) estimation of this value function.
\[ \widehat{\pi}_\text{opt} = \underset{\pi}{\arg\max} \,\, \frac{1}{n} \sum_{i=1}^n \frac{R_i \cdot \mathbf{1}\{A_i = \pi(X_i)\} }{\widehat{\Pr}(A_i | X_i)}. \]
We can simply change the equal sign to unequal sign and this would become a minimization problem:
\[ \widehat{\pi}_\text{opt} = \underset{\pi}{\arg\min} \frac{1}{n} \sum_{i=1}^n \frac{R_i \cdot 1\{A_i \neq \pi(X_i)\} }{\widehat{\Pr}(A_i | X_i)} \]
The interesting view of this formulation is that, this becomes a classification problem, where
Hence, our goal is to perform a weighted classification model. We want to encourage the model to do better job for those with high rewards (large weights). The following code visually demonstrate how this estimator works.
set.seed(1)
n = 800
p = 2
x1 <- runif(n, 0, 1)
x2 <- runif(n, 0, 1)
a = rbinom(n, 1, 0.5)*2-1
side = sign(x2 - sin(2*pi*x1)/3-0.5)
R <- rnorm(n, mean = ifelse(side==a, 1.5, 0.5), sd = 0.5)
# the observed data, with class labels
par(mar=rep(1,4))
plot(x1, x2, pch = 19, xaxt = "n", yaxt = "n", xlab = "", ylab = "",
col = ifelse(a==1, "deepskyblue", "darkorange"))
legend("topright", c("Treatment = 1", "Treatment = -1"), pch = 19,
col = c("deepskyblue", "darkorange"), cex = 1.3)
Now lets add the reward as weights, indicated by the size. Its clear that on the lower side, treatment \(-1\) is more dominating and one the upper side, \(1\) is more dominating. The true decision line is also given below.
par(mfrow = c(1, 2))
par(mar=rep(1,4))
plot(x1, x2, pch = 19, xaxt = "n", yaxt = "n", xlab = "", ylab = "",
cex = (R+2)/3, col = ifelse(a==1, "deepskyblue", "darkorange"))
legend("topright", c("Treatment = 1", "Treatment = -1"), pch = 19,
col = c("deepskyblue", "darkorange"), cex = 1.3)
plot(x1, x2, pch = 19, xaxt = "n", yaxt = "n", xlab = "", ylab = "",
cex = (R+2)/3, col = ifelse(a==1, "deepskyblue", "darkorange"))
legend("topright", c("Treatment = 1", "Treatment = -1"), pch = 19,
col = c("deepskyblue", "darkorange"), cex = 1.3)
x1_grid = sort(x1)
lines(x1_grid, sin(2*pi*x1_grid)/3+0.5, lwd = 3)
The outcome weighted learning framework permits a wide range of functional classes for estimating \(\pi\). For example, we can use a simple linear model, a function of reproducing kernel Hilbert space, a decision tree, a random forest, or a neural network. But also be aware that the functional class cannot be too flexible, otherwise it would exactly fit the observed data with \(\widehat\pi(X_i) = A_i\). The following code demonstrates the idea using a simple linear model. This can fit reasonably well but we know that a linear function would lead to a large bias for our problem because the true optimal treatment regime is a sign function.
mydata = data.frame("x1" = x1, "x2" = x2, "R" = R, "a" = a)
# estimate the propensity score
# although we know this is a randomized trial
pscore = glm(as.factor(a) ~ x1 + x2, data = mydata, family = "binomial")$fitted.values
# calculate the subject specific weight
# We add a constant to W, which does not change the optimization
W = (mydata$R + 2)/pscore
# fit a weighted classification model to estimate the OTR
fit = glm(as.factor(a) ~ x1 + x2, data = mydata, family = "binomial", weights = W)
# the decision rule
table(fit$fitted.values > 0.5, side)
## side
## -1 1
## FALSE 394 90
## TRUE 24 292
mean((fit$fitted.values > 0.5) == (side == 1))
## [1] 0.8575
Since linear model does not perform well, let’s try to use a much more flexible model. Often times, a Reproducing Kernel Hilbert Space (RKHS) is a common choice to specify the functional class for \(\pi\). Then our proposed OWL framework would be updated to the following:
\[ \widehat{\pi}_\text{opt} = \underset{\pi}{\arg\min} \frac{1}{n} \sum_{i=1}^n \frac{R_i \cdot 1\{A_i \neq \pi(X_i)\} }{\widehat{\Pr}(A_i | X_i)} + \lambda \| \pi \|_{\cal H}^2. \]
where the penalty term is the norm (complexity) of the estimated function in the RKHS. By the reproducing property and the Representer Theorem we can show that the estimated policy would take the form
\[ \widehat{\pi}(x) = \sum_{i=1}^n \alpha_i K(x, x_i) \]
where \(K(\cdot, \cdot)\) is the kernel function, and then the penalty term is simply
\[ \begin{aligned} \| \widehat{\pi} \|_{\cal H}^2 &= \langle \sum_{i=1}^n \alpha_i K(x, x_i), \sum_{j=1}^n \alpha_j K(x, x_j) \rangle_{\cal H} \\ &= \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j K(x_i, x_j) \\ &= \balpha^T \bK \balpha \end{aligned} \]
where \(\bK\) is the \(n \times n\) kernel matrix with the \((i,j)\)-th entry being \(K(x_i, x_j)\). And for solving the weighted classification problem, support vector machines (SVM) is a popular choice. For the background of SVM, we refer to this lecture note. This estimation can be carried out using the WeightSVM
package, by specifying subject weights.
library(WeightSVM)
owl.fit = wsvm(a ~ x1 + x2, data = mydata,
kernel = "radial", gamma = 0.15,
weight = W)
table(owl.fit$fitted > 0, side)
## side
## -1 1
## FALSE 395 63
## TRUE 23 319
mean((owl.fit$fitted > 0) == (side == 1))
## [1] 0.8925
par(mfrow = c(1, 1))
par(mar=rep(1,4))
plot(x1, x2, pch = 19, xaxt = "n", yaxt = "n", xlab = "", ylab = "",
col = ifelse(owl.fit$fitted > 0, "deepskyblue", "darkorange"))
legend("topright", c("Treatment = 1", "Treatment = -1"), pch = 19,
col = c("deepskyblue", "darkorange"), cex = 1.3)
lalonde
DataLet’s use the lalonde
data again. Recall that the outcome is re78
, the treatment treat
is whether the subject received a job training program. For this example, we will use the DTRlearn2
package.
library(Matching)
## Loading required package: MASS
## ##
## ## Matching (Version 4.10-14, Build Date: 2023-09-13)
## ## See https://www.jsekhon.com for additional documentation.
## ## Please cite software as:
## ## Jasjeet S. Sekhon. 2011. ``Multivariate and Propensity Score Matching
## ## Software with Automated Balance Optimization: The Matching package for R.''
## ## Journal of Statistical Software, 42(7): 1-52.
## ##
data(lalonde)
head(lalonde)
library(DTRlearn2)
## Loading required package: kernlab
## Loading required package: Matrix
## Loading required package: foreach
## Loading required package: glmnet
## Loaded glmnet 4.1-8
X = data.frame("age" = scale(lalonde$age),
"educ" = scale(lalonde$educ),
"black" = lalonde$black,
"hisp" = lalonde$hisp,
"married" = lalonde$married,
"nodegr" = lalonde$nodegr,
"re74" = lalonde$re74/max(lalonde$re74),
"re75" = lalonde$re75/max(lalonde$re75))
A = 2*(as.numeric(lalonde$treat) - 0.5)
R = lalonde$re78
# fit the OWL model
OWL.fit = owl(H = data.matrix(X), A = A, R = R,
K = 1, # for our problem, it is only one stage of treatment so we use K = 1
kernel = "rbf", # radial basis function kernel
sigma = 0.5) # bandwidth for each covariate
## Warning in select[which(select == TRUE)] <- (results[[j]]$treatment == AA[[j]][select]): number of items to replace is not a multiple of replacement length
# the estimated value function
OWL.fit$valuefun
## [1] 6957.606
This estimated value function is higher than the average outcome of the control group (4554.8), or the treatment (6349.15), indicating that there is a benefit of using individualized treatment rather than the one-size-fits-all approach.
In the previous sections, we have mainly focused on the binary treatment. However, in many cases, the treatment can be continuous. The most common example is the dosage of a drug. In this case, we need to modify the methods we have learned to accommodate the continuous treatment. There are several simple ways, for example, we can discretize the treatment into several levels and then apply the methods we have learned. However, this would greatly reduce the information we have for each treatment level. We can also use a linear regression and add interaction terms (possibly even second order terms of \(A\)) between the treatment and the covariates. And when we need to choose the best treatment, we can either solve for the optimal treatment directly or use a grid search. However, we know that linear models can be easily biased. Hence some dedicated methods are needed. A natural thinking is to utilize the idea of kernel to modify the outcome weighted learning approach. And this leads to the method proposed in Chen, Zeng, and Kosorok (2016).
Let’s recall that the outcome weighted learning approach for binary/multi-categorical treatment is to solve the following optimization problem:
\[ \widehat{\pi}_\text{opt} = \underset{\pi}{\arg\max} \frac{1}{n} \sum_{i=1}^n \frac{R_i \cdot 1\{A_i = \pi(X_i)\} }{\widehat{\Pr}(A_i | X_i)} \]
When the treatment is continuous, we need to modify the indicator function since the chance that \(A_i = \pi(X_i)\) would be measure 0. It might be helpful to revisit our definition of the value function and the Radon-Nikodym theorem we used previously that convert from the observed behavior policy distribution to the target policy distribution. But having a target policy as \((R, X, A = \pi(X))\) would be difficult to work with since it is degenerated due to the continuous nature of \(A\). In this case, how about we consider that \(A\) follows a distribution with \(\pi(X)\) as the mean. This is in fact connected with the literature of reinforcement learning, where we use a stochastic target policy2. For example, let’s assume that \(A\) follows a Uniform distribution on \([\pi(X) - h, \pi(X) + h]\) with some bandwidth choice of \(h\). The density function of \(A\), given that the mean is \(\pi(X)\), would be
\[ \pr_\pi(A) = \frac{1}{2h} \cdot 1\left\{A \in [\pi(X) - h, \pi(X) + h]\right\}. \]
Then if we revisit the definition of the value function, we would have
\[ \begin{aligned} \cV(\pi) =& \int R d\mu^\pi \\ =& \int R \frac{d\mu^\pi}{d\mu}d\mu \\ =& \int \frac{R \cdot \pr_\pi ( A ) }{\pr(A | X)} d\mu \\ =& \int \frac{R \cdot 1\{A \in [\pi(X) - h, \pi(X) + h]\} }{ 2h \cdot \pr(A | X)} d\mu \\ \end{aligned} \]
This leads to the sample version of the estimator, since we observe samples from \(\mu\):
\[ \widehat{\pi}_\text{opt} = \underset{\pi}{\arg\max} \frac{1}{n} \sum_{i=1}^n \frac{R_i \cdot \mathbf{1}\left[ A_i \in \pi(X_i) - h, \pi(X_i) + h \right] }{ 2 h \widehat{\pr}(A_i | X_i)}. \]
This permits a variety of choices for the kernel function. Instead of using a Uniform kernel density, we can in general write
\[ \widehat{\pi}_\text{opt} = \underset{\pi}{\arg\max} \frac{1}{n} \sum_{i=1}^n \frac{R_i \cdot K_h( A_i, \pi(X_i)) }{ \widehat{\pr}(A_i | X_i)}. \]
where \(K_h(x, z) = K( (x - z)/h)/h\) is a kernel function. In this case, we would assume that \(A\) follows the distribution associated with \(K\), centered at \(\pi(X_i)\). In Chen, Zeng, and Kosorok (2016), however, the Uniform kernel is used because the empirical loss can be translated into solving convex problems using a trick called the DC (difference of convex functions) algorithm (Tao and An 1998). This makes the computational task easier.
A lot of our previously mentioned method utilize the Reproducing Kernel Hilbert Space (RKHS) to estimate the optimal treatment regime. However, the RKHS method can be computationally expensive when the number of observations is large, it may have poor performance when the treatment rule is non-continuous (mis-specified functional space) and it may also suffer from the curse of dimensional. In addition, the choice of the kernel function can be difficult. To this end, random forests provide unique advantages, since it can naturally handle high-dimensional data and be adaptive to the sparsity of the model. Furthermore, a random forest decision rule can handle non-linear and non-continuous treatment rules well.
Using random forests to solve the optimal treatment regime under the outcome weighted learning framework would be relatively straightforward. We can use a random forest to estimate the propensity score (use the out-of-bag prediction) and also follow the subject-weighted classification objective function to fit each tree. The following code demonstrate an example using the RLT
package from GitHub to implement this idea. Be careful that these tasks (weighted classification) can be very sensitive to tuning parameters. These results in the following are after some extensive tuning.
# install.packages("devtools")
# devtools::install_github("teazrq/RLT")
# this package needs to be compiled if you have trouble installing it, please see
# https://teazrq.github.io/random-forests-tutorial/rlab/basics/packages.html
library(RLT)
## RLT and Random Forests v4.2.6
## pre-release at github.com/teazrq/RLT
set.seed(1)
n = 800
p = 5
x = matrix(runif(n*p), n, p)
ps = exp(x[,3]-0.5)/(1+exp(x[,3]-0.5))
a = rbinom(n, 1, ps)*2-1
side = sign(x[,2] - sin(2*pi*x[,1])/3-0.5)
R <- rnorm(n, mean = ifelse(side==a, 2, 0.5) + 3*x[, 4] + 3*x[, 5], sd = 0.5)
# fit the propensity score
# the $Prob is the OOB estimation of the propensity score
pscore.fit <- RLT(x = x[, 1:p], y = as.factor(a),
ntrees = 5000, mtry = 2, nmin = 2*sqrt(n))
# We can check how well it fits the propensity score
par(mar=rep(2, 4))
plot(ps, pscore.fit$Prob[, 2],
xlab = "True Propensity Score",
ylab = "Estimated Propensity Score",
xlim = c(0.2, 0.8), ylim = c(0.2, 0.8))
abline(0, 1, col = "red", lwd = 2)
# calculate subject weight by forcing them to be positive
W = (R - min(R))/pscore.fit$Prob[, 2]
# Fit random forest with subject weight
owl.fit <- RLT(x = x[, 1:p], y = as.factor(a), obs.w = W,
ntrees = 1000, mtry = 4, nmin = 2*sqrt(n))
oobrule = owl.fit$Prob[, 2] > 0.5
table(oobrule, side)
## side
## oobrule -1 1
## FALSE 311 118
## TRUE 107 264
mean(oobrule == (side == 1))
## [1] 0.71875
# plot the estimated treatment regime
par(mar=rep(2, 4))
plot(x[,1], x[,2], pch = 19, xaxt = "n", yaxt = "n", xlab = "", ylab = "",
col = ifelse(oobrule, "deepskyblue", "darkorange"))
legend("topright", c("Treatment = 1", "Treatment = -1"), pch = 19,
col = c("deepskyblue", "darkorange"), cex = 1.3)
One potential issue of the OWL is that it maybe quite sensitive to the scales or location shifts of the outcome weight \(R\). Those subjects that have much larger weights (due to comparative \(X\)) can dominate the learning process. To mitigate this issue, we notice that subtracting any functions of \(X\) from \(R\) would not change the maximizer of the problem:
\[ \begin{aligned} & \quad \,\, \E \left[ \frac{ \left[ R - g(X) \right] \cdot 1\{A_i = \pi(X)\} }{\Pr(A | X)} \right] \\ &= \E \left[ \frac{ R \cdot 1\{A_i = \pi(X)\} }{\Pr(A | X)} \right] - E \left[ \frac{ g(X) \cdot 1\{A_i = \pi(X)\} }{\Pr(A | X)} \right] \\ &= \cV(\pi) - \E_X\left[ \E_A \left[ \frac{ g(X) \cdot 1\{A_i = \pi(X)\} }{\Pr(A | X)} \Biggm| X \right] \right]. \\ &= \cV(\pi) - \E_X\left[ g(X) \E_A \left[ \frac{ 1\{A_i = \pi(X)\} }{\Pr(A | X)} \Biggm| X \right] \right]. \\ &= \cV(\pi) - \E_X\left[ g(X) \E_A \left[ \frac{ 1\{\pi(X) = 1\}1\{A = 1\} + 1\{\pi(X) = 0\}1\{A = 0\}}{\Pr(A | X)} \Biggm| X \right] \right]. \\ &= \cV(\pi) - \E_X\left[ g(X) \E_A \left[ 1\{\pi(X) = 1\} + 1\{\pi(X) = 0\} \Biggm| X \right] \right]. \\ &= \cV(\pi) - \E_X\left[ g(X) \right]. \\ \end{aligned} \]
Hence, this quantity does not depend on the choice of the decision rule \(\pi\). Interestingly, by choosing \(g(X)\) reasonably, we can always reduce the variance of our estimator. A convenient choice used in Zhou et al. (2017) is
\[ g(X) = \E\left[ \frac{R}{2 \Pr(A | X)} | X \right] = \frac{\E[R | X, A = 1] + \E[R | X, A = 0]}{2}. \]
while in Zhu et al. (2017),
\[ g(X) = \E[ R | X] = \E[ R | X, A = 1] \Pr(A = 1 | X) + \E[ R | X, A = 0] \Pr(A = 0 | X). \]
was used. In both cases, this would reduce the variance of the estimator, and is especially helpful for random forest splitting. The following is an example of the same dataset.
# Fit a outcome model
r.fit <- RLT(x = x[, 1:p], y = R,
ntrees = 1000, mtry = 3, nmin = sqrt(n))
# calculate the subject weight by adjusting the mean
residual = R - r.fit$Prediction
# compare the variance
var(R)
## [1] 2.430013
var(residual)
## [,1]
## [1,] 0.9138632
# define the new weight
W = (residual - min(residual))/pscore.fit$Prob[, 2]
# Fit random forest with subject weight
owl.fit <- RLT(x = x[, 1:p], y = as.factor(a), obs.w = W,
ntrees = 1000, mtry = 5, nmin = 3*sqrt(n))
oobrule = owl.fit$Prob[, 2] > 0.5
table(oobrule, side)
## side
## oobrule -1 1
## FALSE 353 70
## TRUE 65 312
mean(oobrule == (side == 1))
## [1] 0.83125
# plot the estimated treatment regime
par(mar=rep(2, 4))
plot(x[,1], x[,2], pch = 19, xaxt = "n", yaxt = "n", xlab = "", ylab = "",
col = ifelse(oobrule, "deepskyblue", "darkorange"))
legend("topright", c("Treatment = 1", "Treatment = -1"), pch = 19,
col = c("deepskyblue", "darkorange"), cex = 1.3)
In some cases, the observed outcome can be negative. However, our definition of the optimal treatment is invariant to the location change of R. We can easily verify that \(\E\left[ \frac{\alpha \cdot 1\{A_i \neq \pi(X_i)\}}{\Pr(A_i | X_i)}\right] = \alpha\) for any constant \(\alpha\). Hence, we can simply add a constant to all \(R_i\) to make them all positive.↩︎
Note that in Chen, Zeng, and Kosorok (2016), the authors started with a different prospective by replacing the indicator function with bit of room of error and still view the problem as a deterministic policy with \(A = \pi(X)\), but the computational framework would then align with the stochastic policy view.↩︎