In the previous lecture, we focused on causal inference and the estimation of optimal treatment rules in a single‐stage setting, where methods such as inverse probability weighting, doubly robust estimation, and outcome weighted learning were used to identify the best static treatment decision. In this lecture, we extend these ideas to sequential decision making, where treatment decisions are made repeatedly over time in response to evolving patient information. This falls into the framework of dynamic treatment regimes (DTRs, Laber et al. (2014)), which within the consideration of precision medicine (Kosorok and Laber 2019). In this lecture, we will introduce the formulation of multi-stage decision making, and gives several examples for how to solve such dynamic decision rules, including the the principle of backward induction, the Q-learning and A-learning algorithms. In the next lecture, we will explore some theoretical properties of these methods.
We consider a longitudinal study where each individual is observed over \(T\) decision stages. At each stage \(t = 1, \dots, T\), we collect
Oftentimes, for simplicity we can just assume that \(\cS_t = \cS\), \(\cA_t = \cA\) for all \(t\), meaning that the space of states and decisions being considered are the same across time. After running this study, we observe a terminal outcome \(Y\) at the end of stage \(T\). In some applications, we may observe intermediate rewards \(R_t\) at each stage \(t\), which can be used to define \(Y = \sum_{t=1}^T R_t\). However, in this traditional DTR setting, we only assume that we observe the final outcome \(Y\) at the end of stage \(T\). To record the history of states and actions along a trajectory, we define
\[ \bar{S}_t = (S_1, \dots, S_t), \quad \text{and} \,\, \bar{A}_t = (A_1, \dots, A_t) \]
The entire data for one subject is therefore
\[ \mathcal{O} = (\bar{S}_T, \bar{A}_T, Y) \]
and we observe \(n\) independent trajectories. A dynamic treatment regime (DTR) is a sequence of decision rules
\[ \bpi = (\pi_1, \pi_2, \dots, \pi_T), \]
where each \(\pi_t : \cH_t \to \cA_t\) is a function that gives treatment \(A_t\) based on the current history \(H_t = (\bar{S}_t, \bar{A}_{t-1}) \in \cH_t\). To goal of DTR is to optimize the terminal stage outcome \(Y\), defined as
\[ V(\bpi) = \E\big[ Y^{(\bpi)} \big], \]
where \(Y = \sum_{t=1}^{T} R_t\) is the total reward. Here, \(Y^{(\bpi)}\) denotes the potential outcome under the regime \(\bpi\). It should be noted that in practice, we do not need to observe \(R_t\) at each stage, as long as we can observe the final outcome \(Y\) at the end of stage \(T\). Our goal is to find an optimal dynamic treatment regime
\[ \bpi^* = \arg\max_{\bpi} V(\bpi). \]
bmiData from DynTxRegimepackageTo illustrate the above notation, we consider a two-stage clinical trial dataset bmiData from the DynTxRegime R package. This an artificial dataset, but mimics the study of the effect of meal replacement on adolescent obesity. At the baseline stage, gender, race, parentBMI, baselineBMI were collected, at the second stage, month4BMI was collected. Decisions A1 and A2 indicate whether the participant received the meal replacement (MR) or control diet (CD) at each stage. The outcome month12BMI is the outcome collected at the end of stage 2.
The data flow can be visualized as
library(DynTxRegime)
## Loading required package: modelObj
data("bmiData")
# do a spaghetti plot colored by unique combinations of A1 and A2
# use red, blue, green and darkorange for different combinations
library(ggplot2)
bmiData$ID <- 1:nrow(bmiData)
bmiData_long <- reshape2::melt(bmiData, id.vars = c("ID", "A1", "A2"),
measure.vars = c("baselineBMI", "month4BMI", "month12BMI"),
variable.name = "Time", value.name = "BMI")
ggplot(bmiData_long, aes(x = Time, y = BMI, group = ID, color = interaction(A1, A2))) +
geom_line(alpha = 0.5) +
scale_color_manual(values = c("MR.MR" = "red", "MR.CD" = "blue", "CD.MR" = "green", "CD.CD" = "darkorange")) +
labs(title = "BMI Trajectories by Treatment Combinations",
x = "Time Point", y = "BMI", color = "Treatment (A1.A2)") +
theme_minimal()
Similarly to our previous causal inference setting, we would assume the following assumptions to ensure identifiability of the optimal treatment regime. These are essentially generalizations of the standard causal assumptions to the longitudinal setting.
We can assume a data generating model similar to the single-stage setting, to calculate the expected outcome under a given (potentially stochastic) policy \(\pi\):
\[ \begin{aligned} p^{(\pi)}(s_1, a_1, \dots, s_T, a_T, y) &= \underbrace{p(s_1)}_{\text{initial state}} \;\underbrace{\pi_1(a_1 \mid s_1)}_{\text{policy at } t=1} \\ &\quad \times \prod_{t=2}^{T} \Big\{ \underbrace{p(s_t \mid \bar{s}_{t-1}, \bar{a}_{t-1})}_{\text{state transition}} \;\underbrace{\pi_t(a_t \mid \bar{s}_t, \bar{a}_{t-1})}_{\text{policy at } t} \Big\} \\ &\quad \times \underbrace{p(y \mid \bar{s}_T, \bar{a}_T)}_{\text{terminal outcome model}} . \end{aligned} \]
Under this model, the expected outcome under policy \(\bpi\) can be expressed as
\[ \begin{aligned} \E\!\left[ Y^{(\bpi)} \right] &= \int \E\!\left[ Y \mid \bar{S}_T = \bar{s}_T, \bar{A}_T = \bar{a}_T \right] \; p^{(\bpi)}(\bar{s}_T, \bar{a}_T) \; d\bar{s}_T\, d\bar{a}_T \\[6pt] &= \int \E\!\left[ Y \mid \bar{s}_T, \bar{a}_T \right] \; p(s_1) \;\pi_1(a_1 \mid s_1) \;\prod_{t=2}^{T} \Big\{ p(s_t \mid \bar{s}_{t-1}, \bar{a}_{t-1}) \;\pi_t(a_t \mid \bar{s}_t, \bar{a}_{t-1}) \Big\} \; d\bar{s}_T\, d\bar{a}_T . \end{aligned} \tag{1} \]
This is typically refereed to as the G-formula (with “G” stands for “generalized”) in the causal inference literature Murphy (2003).
The G-formula provides a way to identify the value of any regime \(\bpi\) from the observed data. However, to find the optimal regime \(\bpi^*\), we need to optimize the expected outcome over all possible sequences of decision rules. Directly maximizing the G-formula is almost infeasible when \(T\) or the state space is large. Instead, we can take advantage of the sequential structure of the problem via Q-learning, which is based on the principles of dynamic programming and backward induction. The key insight is that the optimal treatment rule at each stage can be determined recursively, working backward from the final stage. Starting from the G-formula, we can decompose the expected outcome stage by stage.
Note that at the last stage, we have the following components in the density:
\[ \E[Y \mid \bar{s}_T, \bar{a}_T] \, \pi_T(a_T \mid \bar{s}_T, \bar{a}_{T-1}). \]
By slightly changing the notation, we can separate the last-stage information:
\[ h_T = (\bar{s}_T, \bar{a}_{T-1}), \]
and rewrite the above as
\[ \E[Y \mid h_T, a_T] \, \pi_T(a_T \mid h_T). \]
This is analogous to the single-stage causal inference problem, where \(h_T\) plays the role of covariates and \(a_T\) the treatment. This motivates defining an action-value function (or Q-function) at stage \(T\) as
\[ Q_T(h_T, a_T) = \E[Y \mid H_T = h_T, A_T = a_T]. \]
The policy-induced joint density can be factorized as
\[ p^{(\bpi)}(\bar{s}_T, \bar{a}_T) = \Big[ p^{(\bpi)}(\bar{s}_{T-1}, \bar{a}_{T-1}) \, p(s_T \mid h_{T-1}, a_{T-1}) \Big] \pi_T(a_T \mid h_T), \] so that \[ p^{(\bpi)}(h_T) = p^{(\bpi)}(\bar{s}_{T-1}, \bar{a}_{T-1}) \, p(s_T \mid h_{T-1}, a_{T-1}). \]
Substitute this into the G-formula:
\[ \begin{aligned} \E[Y^{(\bpi)}] &= \int \E[Y \mid h_T, a_T] \, \pi_T(a_T \mid h_T) \, p^{(\bpi)}(h_T) \, da_T \, dh_T \\[4pt] &= \E_{H_T}\!\Big[ \E_{A_T\sim\pi_T(\cdot\mid H_T)} \big[ Q_T(H_T, A_T) \big] \Big] \\[4pt] &= \E_{H_T}\!\left[ V_T^{\pi_T}(H_T) \right], \end{aligned} \] where \(V_T^{\pi_T}(h_T) = \E_{A_T\sim\pi_T(\cdot\mid h_T)} [ Q_T(h_T, A_T) ]\) is also called the value function at stage \(T\) under the final-stage policy \(\pi_T\).
Now, we can work backward to stage \(T-1\). The components in the G-formula involving stage \(T-1\) are
\[ V_T^{\pi}(h_T) \, p(s_T \mid h_{T-1}, a_{T-1}) \, \pi_{T-1}(a_{T-1} \mid h_{T-1}). \]
With the understanding that
\[ h_{T-1} = (\bar{s}_{T-1}, \bar{a}_{T-2}), \]
we can define the stage-\((T-1)\) Q-function as the conditional expectation of the continuation value \(V_T^{\pi}(H_T)\), given the history up to stage \(T-1\) and the current action \(A_{T-1}\):
\[ Q_{T-1}(h_{T-1}, a_{T-1}) = \E\!\left[ V_T^{\pi}(H_T) \mid H_{T-1} = h_{T-1}, A_{T-1} = a_{T-1} \right]. \]
Here, \(V_T^{\pi}(H_T)\) itself depends on the future history \(H_T = (H_{T-1}, A_{T-1}, S_T)\), but we do not model this dependence explicitly. Instead, the conditional expectation integrates over the distribution of \(S_T\) induced by \(p(s_T \mid h_{T-1}, a_{T-1})\). That is,
\[ Q_{T-1}(h_{T-1}, a_{T-1}) = \int V_T^{\pi}(h_T)\, p(s_T \mid h_{T-1}, a_{T-1})\, ds_T. \]
In practice, when estimating \(Q_{T-1}\), we do not need an explicit model for the transition distribution \(p(s_T \mid h_{T-1}, a_{T-1})\). We only need to estimate the expected value as a regression of \(V_T^{\pi}(H_T)\) from stage \(T\), and treat that as a pseudo-outcome \(Y\) observed for each individual. This pseudo-outcome corresponds to the expected future outcome after taking action \(A_{T-1}\) given history \(H_{T-1}\). In policy optimization, this pseudo-outcome is replaced with the estimated optimal value so that it can be learned recursively at stage \(T-1\).
Substituting this definition back into the G-formula gives
\[ \E[Y^{(\bpi)}] = \E_{H_{T-1}}\!\Big[ \E_{A_{T-1}\sim\pi_{T-1}(\cdot\mid H_{T-1})} \big[ Q_{T-1}(H_{T-1}, A_{T-1}) \big] \Big] = \E_{H_{T-1}}\!\left[ V_{T-1}^{\pi}(H_{T-1}) \right], \] where \[ V_{T-1}^{\pi}(h_{T-1}) = \E_{A_{T-1}\sim\pi_{T-1}(\cdot\mid h_{T-1})} \big[ Q_{T-1}(h_{T-1}, A_{T-1}) \big]. \]
This process continues recursively for \(t = T-2, T-3, \dots, 1\), leading to the general recursion
\[ Q_t(h_t, a_t) = \E\!\left[ V_{t+1}^{\pi}(H_{t+1}) \mid H_t = h_t, A_t = a_t \right], \qquad V_t^{\pi}(h_t) = \E_{A_t\sim\pi_t(\cdot\mid h_t)} \big[ Q_t(h_t, A_t) \big]. \]
In the case of an optimal policy \(\bpi^*\), the expectation over \(\pi_t\) is replaced by maximization, leading to a Bellman-type optimality recursion:
bmiDataWe can implement the Q-learning algorithm on the bmiData dataset to estimate the optimal dynamic treatment regime for the two-stage meal replacement study. We will use linear regression models to estimate the Q-functions at each stage. Note that this is a minimization problem instead of a maximization problem since we want to minimize BMI.
library(DynTxRegime)
data("bmiData")
# Stage 2 Q-function estimation
bmiData$A2_num <- ifelse(bmiData$A2 == "MR", 1, 0)
stage2_Q <- lm(month12BMI ~ month4BMI*A2_num, data = bmiData)
# Predict Q2 values for both treatment options
bmiData$Q2_MR <- predict(stage2_Q, newdata = transform(bmiData, A2_num = 1))
bmiData$Q2_CD <- predict(stage2_Q, newdata = transform(bmiData, A2_num = 0))
# Optimal treatment at stage 2
bmiData$A2_opt <- ifelse(bmiData$Q2_MR < bmiData$Q2_CD, "MR", "CD")
# value of optimal treatment at stage 2
bmiData$V2_opt <- pmin(bmiData$Q2_MR, bmiData$Q2_CD)
# Predict Q1 values for stage 1 using V2_opt as outcome
bmiData$A1_num <- ifelse(bmiData$A1 == "MR", 1, 0)
stage1_Q <- lm(V2_opt ~ (baselineBMI + gender + race + parentBMI)*A1_num, data = bmiData)
# Predict Q1 values for both treatment options
bmiData$Q1_MR <- predict(stage1_Q, newdata = transform(bmiData, A1_num = 1))
bmiData$Q1_CD <- predict(stage1_Q, newdata = transform(bmiData, A1_num = 0))
# Optimal treatment at stage 1
bmiData$A1_opt <- ifelse(bmiData$Q1_MR < bmiData$Q1_CD, "MR", "CD")
# value of optimal treatment at stage 1
bmiData$V1_opt <- pmin(bmiData$Q1_MR, bmiData$Q1_CD)
# compare the density plot of observed and optimal treatments
library(ggplot2)
ggplot() +
geom_density(data = bmiData, aes(x = month12BMI, fill = "Observed"), alpha = 0.4) +
geom_density(data = bmiData, aes(x = V1_opt, fill = "Optimal"), alpha = 0.4) +
scale_fill_manual(values = c("Observed" = "deepskyblue", "Optimal" = "darkorange")) +
labs(title = "Density of Observed vs Optimal BMI at Month 12",
x = "Month 12 BMI", y = "Density", fill = "Legend") +
xlim(25, 50) +
theme_minimal()
This can also be implemented using the qLearn function in the package. We will consider a slightly different setting, in which all the immediate rewards \(R_t\) are zero and we simply observe the final reward at the last stage.
\[ \text{Reward} = -100*\frac{\text{month12BMI}-\text{baselineBMI}}{\text{baselineBMI}}. \]
We need to specify models at the second stage using functions defined in the package.
moMain specifies the main effect covariates (without interaction to the treatment)moCont specifies covariates with interaction to the treatment at current cageWe then use the qLearn function to fit the model at the final stage. qLearn performs a single step of Q-learning algorithm. moMain and moCont together determines the covariates of the model, response refers to the final reward, and txName refers to the treatment variable name assigned at this stage. We can see that the obtained result is the same as our own model.
# define the aggregated reward over two stages
Y <- -100*(bmiData[,6L] - bmiData[,4L])/bmiData[,4L]
# Specify models at the second stage
moMain <- buildModelObj(model = ~ parentBMI + month4BMI,
solver.method = 'lm') # specify main effect
moCont <- buildModelObj(model = ~ race + parentBMI+month4BMI,
solver.method = 'lm') # specify interaction with treatment
# Second-Stage Analysis
fitSS <- qLearn(moMain = moMain, moCont = moCont,
data = bmiData,
response = Y,
txName = 'A2')
## First step of the Q-Learning Algorithm.
##
## Outcome regression.
## Combined outcome regression model: ~ parentBMI+month4BMI + A2 + A2:(race+parentBMI+month4BMI) .
## Regression analysis for Combined:
##
## Call:
## lm(formula = YinternalY ~ parentBMI + month4BMI + A2 + A2:race +
## parentBMI:A2 + month4BMI:A2, data = data)
##
## Coefficients:
## (Intercept) parentBMI month4BMI A2MR A2CD:race A2MR:race parentBMI:A2MR month4BMI:A2MR
## 48.04514 -0.35715 -0.83827 -14.54904 -0.25018 1.64329 0.39890 0.02297
##
##
## Recommended Treatments:
## CD MR
## 98 112
##
## Estimated value: 7.54403
show(fitSS)
## Q-Learning: step 1
## Outcome Regression Analysis
## Combined
##
## Call:
## lm(formula = YinternalY ~ parentBMI + month4BMI + A2 + A2:race +
## parentBMI:A2 + month4BMI:A2, data = data)
##
## Coefficients:
## (Intercept) parentBMI month4BMI A2MR A2CD:race A2MR:race parentBMI:A2MR month4BMI:A2MR
## 48.04514 -0.35715 -0.83827 -14.54904 -0.25018 1.64329 0.39890 0.02297
##
## Recommended Treatments:
## CD MR
## 98 112
##
## Estimated value: 7.54403
# you can extract the optimal treatment and decision function
optTx(fitSS)$optimalTx[1:6]
## [1] MR CD CD CD CD CD
## Levels: CD MR
optTx(fitSS)$decisionFunc[1:6,]
## CD MR
## [1,] 7.818515 8.553214
## [2,] 6.771519 5.095286
## [3,] 8.376248 6.696326
## [4,] 10.965522 8.130337
## [5,] 10.083604 8.743228
## [6,] 10.697010 8.576252
We then fit the model at the first stage. The only difference comparing with the second stage is that we should assign the model we gained from the previous stage fitSS to response, and the treatment assigned is A1 here. We can also check that the result we gained at this step is the same as our own model.
# First-stage Analysis
moMain <- buildModelObj(model = ~ parentBMI + baselineBMI,
solver.method = 'lm') # main effect
moCont <- buildModelObj(model = ~ race + parentBMI + baselineBMI,
solver.method = 'lm') # interaction effects
# fit the model at the first stage
fitFS <- qLearn(moMain = moMain, moCont = moCont,
data = bmiData,
response = fitSS,
txName = 'A1')
## Step 2 of the Q-Learning Algorithm.
##
## Outcome regression.
## Combined outcome regression model: ~ parentBMI+baselineBMI + A1 + A1:(race+parentBMI+baselineBMI) .
## Regression analysis for Combined:
##
## Call:
## lm(formula = YinternalY ~ parentBMI + baselineBMI + A1 + A1:race +
## parentBMI:A1 + baselineBMI:A1, data = data)
##
## Coefficients:
## (Intercept) parentBMI baselineBMI A1MR A1CD:race A1MR:race parentBMI:A1MR baselineBMI:A1MR
## 36.73123 -0.08373 -0.68951 2.78926 0.59410 0.65941 -0.36938 0.23110
##
##
## Recommended Treatments:
## CD MR
## 121 89
##
## Estimated value: 8.374786
optTx(fitFS)$optimalTx[1:6]
## [1] CD MR MR MR MR MR
## Levels: CD MR
optTx(fitFS)$decisionFunc[1:6,]
## CD MR
## [1,] 9.967447 9.433248
## [2,] 8.482970 8.746729
## [3,] 8.794969 8.913004
## [4,] 9.119389 10.236224
## [5,] 11.088749 12.234857
## [6,] 8.981291 9.422195
summary(fitFS)
## $outcome
## $outcome$Combined
##
## Call:
## lm(formula = YinternalY ~ parentBMI + baselineBMI + A1 + A1:race +
## parentBMI:A1 + baselineBMI:A1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1295 -1.1803 -0.0718 1.3649 4.1127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.73123 1.74405 21.061 < 2e-16 ***
## parentBMI -0.08373 0.03736 -2.241 0.02609 *
## baselineBMI -0.68951 0.04992 -13.811 < 2e-16 ***
## A1MR 2.78926 2.66905 1.045 0.29725
## A1CD:race 0.59410 0.36658 1.621 0.10666
## A1MR:race 0.65941 0.37014 1.782 0.07633 .
## parentBMI:A1MR -0.36938 0.04935 -7.484 2.17e-12 ***
## baselineBMI:A1MR 0.23110 0.07440 3.106 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.851 on 202 degrees of freedom
## Multiple R-squared: 0.768, Adjusted R-squared: 0.7599
## F-statistic: 95.51 on 7 and 202 DF, p-value: < 2.2e-16
##
##
##
## $optTx
## CD MR
## 121 89
##
## $value
## [1] 8.374786
Similar to our previous discussion on removing the mean function, there is a similar procedure called A-learning (advantage learning) that focuses on modeling the treatment contract rather than the full Q-function. The idea is to decompose the Q-function into two parts: the mean function \(m(h_t)\) which captures the expected outcome under a reference treatment, and the advantage function \(A_t \cdot \mu(h_t)\), which captures the additional benefit of choosing treatment \(A_t\) over the reference. In the simplest case with a binary treatment \(A_t \in \{-1, 1\}\), we can write the Q-function as
\[ Q_t(h_t, a_t) = m_t(h_t) + a_t \, \mu_t(h_t), \]
The optimal decision rule at stage \(t\) depends only on the sign of \(\mu_t(h_t)\): \[ \pi_t^*(h_t) = \operatorname{sign}\big(\mu_t(h_t)\big). \]
In practice, the advantage function \(\mu_t(h_t)\) can be estimated using an orthogonalized estimating equation:
\[ \E\!\left[ (A_t - e_t(H_t)) \Big(Y - m_t(H_t) - (A_t - e_t(H_t))\, \mu_t(H_t)\Big) \right] = 0. \]
This estimating actually serves an very important purpose: the residual
\[ Y_{T-1}^* = Y - m_T(H_T) - (A_T - e_T(H_T))\,\mu_T(H_T), \]
would be orthogonal to the treatment assignment \(A_{T-1}\), which removes the confounding effect from previous stages. This orthogonalization helps to isolate the treatment effect at each stage, leading to more robust estimation of the advantage function. At stage \(T-1\), the Bellman recursion optimizes
\[ \arg\max_{a_{T-1}} \E[V_T^*(H_T) \mid H_{T-1}, A_{T-1}=a_{T-1}], \]
where \(V_T^*(H_T) = \max_{a_T} Q_T(H_T, a_T)\). In A-learning, we replace \(V_T^*(H_T)\) with this mean-removed pseudo-outcome \(Y_{T-1}^*\), and because this residualization removes components uncorrelated with the current treatment, we have
\[ \arg\max_{a_{T-1}} \E[V_T^*(H_T)\mid H_{T-1},A_{T-1}] = \arg\max_{a_{T-1}} \E[Y_{T-1}^*\mid H_{T-1},A_{T-1}] \]
lead to the same optimal decision rule \(\pi_{T-1}^*\). A-learning thus preserves the same optimization objective as Q-learning, but expresses it through an orthogonalized outcome that isolates the treatment contrast.
Starting from the G-formula (1), if we want to evaluate or optimize a new deterministic target policy \(\bpi = (\pi_1, \dots, \pi_T)\), its induced likelihood can be written as
\[ \underbrace{p(s_1) \, \One\{a_1 = \pi_1(s_1)\}}_{\text{initial stage}} \Bigg\{ \prod_{t=2}^T \underbrace{p(s_t \mid \bs_{t-1}, \ba_{t-1}) \, \One\{a_t = \pi_t(\bs_t, \ba_{t-1})\}}_{\text{transition and action at stage } t} \Bigg\} \underbrace{p( y \mid \bs_T, \ba_T)}_{\text{outcome model}}. \tag{2} \]
By taking the ratio of the target and observed likelihoods and applying the Radon–Nikodym theorem, the value function of the target policy under the observed data distribution can be written as:
\[ \cV_\bpi = \E_\bmu \!\left[ \frac{\prod_{t=1}^T \One\{ A_t = \pi_t(\bS_t, \bA_{t-1}) \}}{\prod_{t=1}^T \mu_t(A_t \mid \bS_t, \bA_{t-1})} \sum_{t=1}^T R_t \right], \] where the term inside the expectation corresponds to the Radon–Nikodym derivative weighting the observed outcome by the inverse of the behavior policy \(\bmu\). Here we assume additive stagewise rewards \(R_t\), so that the total reward is \(\sum_{t=1}^T R_t\). We also ignore the initial state distribution \(p(s_1)\) and consider the marginal population value.
For simplicity, we focus on a two-stage case (\(T = 2\)) and denote by \(H_t\) the observed history prior to stage \(t\), that is, \(H_1 = S_1\) and \(H_2 = (S_1, A_1, S_2)\).
Then the value function under policy \(\bpi\) can be written as
\[ \cV_\bpi = \E_\bmu \!\left[ \frac{ \One\{A_1 = \pi_1(H_1)\} \One\{A_2 = \pi_2(H_2)\} }{ \mu_1(A_1 \mid H_1) \mu_2(A_2 \mid H_2) } (R_1 + R_2) \right]. \tag{3} \]
We now introduce two optimization methods from Zhao et al. (2015) based on this formulation.
One approach is to optimize equation (3) jointly over both stages. This can be viewed as a multi-class classification problem, since for binary actions at each stage there are four possible treatment paths. To approximate the discontinuous indicator, a smooth surrogate loss is used, for example a hinge-type function \[ \psi(z_1, z_2) = \min(z_1, z_2, 1), \] where \(z_t = A_t \cdot f_t(H_t)\) for some functions \(f_1, f_2 \in \mathbb{R}\). The optimization problem can then be written as
\[ (\widehat f_1, \widehat f_2) = \argmax_{f_1, f_2} \frac{1}{n} \sum_{i=1}^n \frac{ \psi\!\big(A_{i1} f_1(H_{i1}), A_{i2} f_2(H_{i2})\big) }{ \widehat\mu_1(A_{i1} \mid H_{i1}) \, \widehat\mu_2(A_{i2} \mid H_{i2}) } (R_{i1} + R_{i2}) - \lambda \big(\|f_1\|^2 + \|f_2\|^2\big). \] The optimal decision rules are \(\hpi_1(H_1) = \operatorname{sign}\big(f_1(H_1)\big)\) and \(\hpi_2(H_2) = \operatorname{sign}\big(f_2(H_2)\big)\). Here each sample is weighted by \((R_{i1}+ R_{i2}) / \big[\widehat\mu_1(A_{i1} \mid H_{i1}) \widehat\mu_2(A_{i2} \mid H_{i2})\big]\).
As the name suggests, backward outcome weighted learning proceeds by backward induction. Similar to Q-learning, we begin at the final stage \(T\), solving a single-stage OWL problem to obtain the optimal decision rule \(\hpi_T(H_T)\). Then, moving backward one stage, we optimize \(\pi_{T-1}\) by maximizing the expected weighted cumulative reward:
\[ \argmax_{\pi_t} \E \!\left[ \frac{ \prod_{j=t+1}^T \One\{A_j = \hpi_j(H_j)\} \, \One\{A_t = \pi_t(H_t)\} }{ \prod_{j=t}^T \mu_j(A_j \mid H_j) } \sum_{j=t}^T R_j \right]. \]
Hence, the procedure is analogous to batch Q-learning, except that OWL avoids explicitly modeling the Q-function by directly optimizing the value expression.
In our previous formulation, we used the terminal outcome \(Y\) as the optimization target. A more general and often convenient representation is to decompose \(Y\) into stagewise rewards:
\[ Y = \sum_{t=1}^T R_t. \]
Under this formulation, we will eventually fall into the reinforcement learning setting. And the overall value of a policy \(\bpi\) can be written as
\[ V^\pi = \E_\pi\!\left[\sum_{t=1}^T R_t\right]. \]
The procedure for estimating the optimal dynamic treatment regime remains similar, with the Q-function at each stage defined as the expected cumulative reward from that stage onward. Starting from the last stage,
\[ Q_T(h_T, a_T) = \E[R_T \mid H_T = h_T, A_T = a_T], \quad V_T^\pi(h_T) = \E_{A_T \sim \pi_T(\cdot \mid h_T)}[Q_T(h_T, A_T)]. \]
And for earlier stages,
\[ Q_t(h_t, a_t) = \E\!\left[R_t + V_{t+1}^\pi(H_{t+1}) \mid H_t = h_t, A_t = a_t\right], \]
and
\[ V_t^\pi(h_t) = \E_{A_t \sim \pi_t(\cdot \mid h_t)}[Q_t(h_t, A_t)]. \]
The optimal regime is then defined as
\[ \begin{aligned} \pi_t^*(h_t) &= \arg\max_{a_t} Q_t^*(h_t, a_t) \\ &= \arg\max_{a_t} \E\!\left[R_t + \max_{a_{t+1}} Q_{t+1}^*(H_{t+1}, a_{t+1}) \mid H_t = h_t, A_t = a_t\right]. \end{aligned} \]