Dynamic Treatment Regimes and Q-Learning

1 Overview

In the previous lecture, we focused on causal inference and the estimation of optimal treatment rules in a single‐stage setting, where methods such as inverse probability weighting, doubly robust estimation, and outcome weighted learning were used to identify the best static treatment decision. In this lecture, we extend these ideas to sequential decision making, where treatment decisions are made repeatedly over time in response to evolving patient information. This falls into the framework of dynamic treatment regimes (DTRs, Laber et al. (2014)), which within the consideration of precision medicine (Kosorok and Laber 2019). In this lecture, we will introduce the formulation of multi-stage decision making, and gives several examples for how to solve such dynamic decision rules, including the the principle of backward induction, the Q-learning and A-learning algorithms. In the next lecture, we will explore some theoretical properties of these methods.

2 Notation and Model Assumption

We consider a longitudinal study where each individual is observed over \(T\) decision stages. At each stage \(t = 1, \dots, T\), we collect

\(S_t \in \cS_t\): the state (or covariates) observed before the \(t\)th treatment decision \(A_t\)
\(A_t \in \cA_t\): the treatment or action assigned at stage \(t\)

Oftentimes, for simplicity we can just assume that \(\cS_t = \cS\), \(\cA_t = \cA\) for all \(t\), meaning that the space of states and decisions being considered are the same across time. After running this study, we observe a terminal outcome \(Y\) at the end of stage \(T\). In some applications, we may observe intermediate rewards \(R_t\) at each stage \(t\), which can be used to define \(Y = \sum_{t=1}^T R_t\). However, in this traditional DTR setting, we only assume that we observe the final outcome \(Y\) at the end of stage \(T\). To record the history of states and actions along a trajectory, we define

\[ \bar{S}_t = (S_1, \dots, S_t), \quad \text{and} \,\, \bar{A}_t = (A_1, \dots, A_t) \]

The entire data for one subject is therefore

\[ \mathcal{O} = (\bar{S}_T, \bar{A}_T, Y) \]

and we observe \(n\) independent trajectories. A dynamic treatment regime (DTR) is a sequence of decision rules

\[ \bpi = (\pi_1, \pi_2, \dots, \pi_T), \]

where each \(\pi_t : \cH_t \to \cA_t\) is a function that gives treatment \(A_t\) based on the current history \(H_t = (\bar{S}_t, \bar{A}_{t-1}) \in \cH_t\). To goal of DTR is to optimize the terminal stage outcome \(Y\), defined as

\[ V(\bpi) = \E\big[ Y^{(\bpi)} \big], \]

where \(Y = \sum_{t=1}^{T} R_t\) is the total reward. Here, \(Y^{(\bpi)}\) denotes the potential outcome under the regime \(\bpi\). It should be noted that in practice, we do not need to observe \(R_t\) at each stage, as long as we can observe the final outcome \(Y\) at the end of stage \(T\). Our goal is to find an optimal dynamic treatment regime

\[ \bpi^* = \arg\max_{\bpi} V(\bpi). \]

2.1 Example: `bmiData` from `DynTxRegime`package

To illustrate the above notation, we consider a two-stage clinical trial dataset bmiData from the DynTxRegime R package. This an artificial dataset, but mimics the study of the effect of meal replacement on adolescent obesity. At the baseline stage, gender, race, parentBMI, baselineBMI were collected, at the second stage, month4BMI was collected. Decisions A1 and A2 indicate whether the participant received the meal replacement (MR) or control diet (CD) at each stage. The outcome month12BMI is the outcome collected at the end of stage 2.

The data flow can be visualized as

  library(DynTxRegime)
## Loading required package: modelObj
  data("bmiData")

  # do a spaghetti plot colored by unique combinations of A1 and A2
  # use red, blue, green and darkorange for different combinations
  library(ggplot2)
  
  bmiData$ID <- 1:nrow(bmiData)
  bmiData_long <- reshape2::melt(bmiData, id.vars = c("ID", "A1", "A2"),
                                 measure.vars = c("baselineBMI", "month4BMI", "month12BMI"),
                                 variable.name = "Time", value.name = "BMI")
  
  ggplot(bmiData_long, aes(x = Time, y = BMI, group = ID, color = interaction(A1, A2))) +
    geom_line(alpha = 0.5) +
    scale_color_manual(values = c("MR.MR" = "red", "MR.CD" = "blue", "CD.MR" = "green", "CD.CD" = "darkorange")) +
    labs(title = "BMI Trajectories by Treatment Combinations",
         x = "Time Point", y = "BMI", color = "Treatment (A1.A2)") +
    theme_minimal()

3 Causal Inference and G-formula

Similarly to our previous causal inference setting, we would assume the following assumptions to ensure identifiability of the optimal treatment regime. These are essentially generalizations of the standard causal assumptions to the longitudinal setting.

Sequential Consistency: For each individual, the observed outcome \(Y\) equals the potential outcome \(Y^{(\bar{A}_T)}\) under the observed treatment sequence \(\bar{A}_T\). \[ Y = Y^{(\bar{A}_T)} \]
Sequential Ignorability: At each stage \(t\), the treatment assignment \(A_t\) is independent of the future potential outcomes given the past history \(H_t\). \[ A_t \perp Y^{(\bar{a}_T)} \mid H_t \]
Positivity: For each stage \(t\), the probability of receiving each treatment option given the past history is positive. \[ P(A_t = a_t \mid H_t) > 0 \quad \text{for all } a_t \in \mathcal{A}_t \]

We can assume a data generating model similar to the single-stage setting, to calculate the expected outcome under a given (potentially stochastic) policy \(\pi\):

\[ \begin{aligned} p^{(\pi)}(s_1, a_1, \dots, s_T, a_T, y) &= \underbrace{p(s_1)}_{\text{initial state}} \;\underbrace{\pi_1(a_1 \mid s_1)}_{\text{policy at } t=1} \\ &\quad \times \prod_{t=2}^{T} \Big\{ \underbrace{p(s_t \mid \bar{s}_{t-1}, \bar{a}_{t-1})}_{\text{state transition}} \;\underbrace{\pi_t(a_t \mid \bar{s}_t, \bar{a}_{t-1})}_{\text{policy at } t} \Big\} \\ &\quad \times \underbrace{p(y \mid \bar{s}_T, \bar{a}_T)}_{\text{terminal outcome model}} . \end{aligned} \]

Under this model, the expected outcome under policy \(\bpi\) can be expressed as

\[ \begin{aligned} \E\!\left[ Y^{(\bpi)} \right] &= \int \E\!\left[ Y \mid \bar{S}_T = \bar{s}_T, \bar{A}_T = \bar{a}_T \right] \; p^{(\bpi)}(\bar{s}_T, \bar{a}_T) \; d\bar{s}_T\, d\bar{a}_T \\[6pt] &= \int \E\!\left[ Y \mid \bar{s}_T, \bar{a}_T \right] \; p(s_1) \;\pi_1(a_1 \mid s_1) \;\prod_{t=2}^{T} \Big\{ p(s_t \mid \bar{s}_{t-1}, \bar{a}_{t-1}) \;\pi_t(a_t \mid \bar{s}_t, \bar{a}_{t-1}) \Big\} \; d\bar{s}_T\, d\bar{a}_T . \end{aligned} \tag{1} \]

This is typically refereed to as the G-formula (with “G” stands for “generalized”) in the causal inference literature Murphy (2003).

4 Q-Learning and Backward Induction

The G-formula provides a way to identify the value of any regime \(\bpi\) from the observed data. However, to find the optimal regime \(\bpi^*\), we need to optimize the expected outcome over all possible sequences of decision rules. Directly maximizing the G-formula is almost infeasible when \(T\) or the state space is large. Instead, we can take advantage of the sequential structure of the problem via Q-learning, which is based on the principles of dynamic programming and backward induction. The key insight is that the optimal treatment rule at each stage can be determined recursively, working backward from the final stage. Starting from the G-formula, we can decompose the expected outcome stage by stage.

4.0.1 Peeling Off Stage \(T\):

Note that at the last stage, we have the following components in the density:

\[ \E[Y \mid \bar{s}_T, \bar{a}_T] \, \pi_T(a_T \mid \bar{s}_T, \bar{a}_{T-1}). \]

By slightly changing the notation, we can separate the last-stage information:

\[ h_T = (\bar{s}_T, \bar{a}_{T-1}), \]

and rewrite the above as

\[ \E[Y \mid h_T, a_T] \, \pi_T(a_T \mid h_T). \]

This is analogous to the single-stage causal inference problem, where \(h_T\) plays the role of covariates and \(a_T\) the treatment. This motivates defining an action-value function (or Q-function) at stage \(T\) as

\[ Q_T(h_T, a_T) = \E[Y \mid H_T = h_T, A_T = a_T]. \]

The policy-induced joint density can be factorized as

\[ p^{(\bpi)}(\bar{s}_T, \bar{a}_T) = \Big[ p^{(\bpi)}(\bar{s}_{T-1}, \bar{a}_{T-1}) \, p(s_T \mid h_{T-1}, a_{T-1}) \Big] \pi_T(a_T \mid h_T), \] so that \[ p^{(\bpi)}(h_T) = p^{(\bpi)}(\bar{s}_{T-1}, \bar{a}_{T-1}) \, p(s_T \mid h_{T-1}, a_{T-1}). \]

Substitute this into the G-formula:

\[ \begin{aligned} \E[Y^{(\bpi)}] &= \int \E[Y \mid h_T, a_T] \, \pi_T(a_T \mid h_T) \, p^{(\bpi)}(h_T) \, da_T \, dh_T \\[4pt] &= \E_{H_T}\!\Big[ \E_{A_T\sim\pi_T(\cdot\mid H_T)} \big[ Q_T(H_T, A_T) \big] \Big] \\[4pt] &= \E_{H_T}\!\left[ V_T^{\pi_T}(H_T) \right], \end{aligned} \] where \(V_T^{\pi_T}(h_T) = \E_{A_T\sim\pi_T(\cdot\mid h_T)} [ Q_T(h_T, A_T) ]\) is also called the value function at stage \(T\) under the final-stage policy \(\pi_T\).

4.0.2 Working Backwards:

Now, we can work backward to stage \(T-1\). The components in the G-formula involving stage \(T-1\) are

\[ V_T^{\pi}(h_T) \, p(s_T \mid h_{T-1}, a_{T-1}) \, \pi_{T-1}(a_{T-1} \mid h_{T-1}). \]

With the understanding that

\[ h_{T-1} = (\bar{s}_{T-1}, \bar{a}_{T-2}), \]

we can define the stage-\((T-1)\) Q-function as the conditional expectation of the continuation value \(V_T^{\pi}(H_T)\), given the history up to stage \(T-1\) and the current action \(A_{T-1}\):

\[ Q_{T-1}(h_{T-1}, a_{T-1}) = \E\!\left[ V_T^{\pi}(H_T) \mid H_{T-1} = h_{T-1}, A_{T-1} = a_{T-1} \right]. \]

Here, \(V_T^{\pi}(H_T)\) itself depends on the future history \(H_T = (H_{T-1}, A_{T-1}, S_T)\), but we do not model this dependence explicitly. Instead, the conditional expectation integrates over the distribution of \(S_T\) induced by \(p(s_T \mid h_{T-1}, a_{T-1})\). That is,

\[ Q_{T-1}(h_{T-1}, a_{T-1}) = \int V_T^{\pi}(h_T)\, p(s_T \mid h_{T-1}, a_{T-1})\, ds_T. \]

In practice, when estimating \(Q_{T-1}\), we do not need an explicit model for the transition distribution \(p(s_T \mid h_{T-1}, a_{T-1})\). We only need to estimate the expected value as a regression of \(V_T^{\pi}(H_T)\) from stage \(T\), and treat that as a pseudo-outcome \(Y\) observed for each individual. This pseudo-outcome corresponds to the expected future outcome after taking action \(A_{T-1}\) given history \(H_{T-1}\). In policy optimization, this pseudo-outcome is replaced with the estimated optimal value so that it can be learned recursively at stage \(T-1\).

Substituting this definition back into the G-formula gives

\[ \E[Y^{(\bpi)}] = \E_{H_{T-1}}\!\Big[ \E_{A_{T-1}\sim\pi_{T-1}(\cdot\mid H_{T-1})} \big[ Q_{T-1}(H_{T-1}, A_{T-1}) \big] \Big] = \E_{H_{T-1}}\!\left[ V_{T-1}^{\pi}(H_{T-1}) \right], \] where \[ V_{T-1}^{\pi}(h_{T-1}) = \E_{A_{T-1}\sim\pi_{T-1}(\cdot\mid h_{T-1})} \big[ Q_{T-1}(h_{T-1}, A_{T-1}) \big]. \]

This process continues recursively for \(t = T-2, T-3, \dots, 1\), leading to the general recursion

\[ Q_t(h_t, a_t) = \E\!\left[ V_{t+1}^{\pi}(H_{t+1}) \mid H_t = h_t, A_t = a_t \right], \qquad V_t^{\pi}(h_t) = \E_{A_t\sim\pi_t(\cdot\mid h_t)} \big[ Q_t(h_t, A_t) \big]. \]

5 Estimating the Optimal Policy

In the case of an optimal policy \(\bpi^*\), the expectation over \(\pi_t\) is replaced by maximization, leading to a Bellman-type optimality recursion:

Iterate: \(t = T, T-1, \dots, 1\)
Estimate the optimal Q-function at stage \(t\) as \[ Q_t^*(h_t, a_t) = \E\!\left[ V_{t+1}^*(H_{t+1}) \mid H_t = h_t, A_t = a_t \right], \]
Optimize the policy \(\pi_t^*(h_t)\) at stage \(t\) by \[ \pi_t^*(h_t) = \arg\max_{a_t} Q_t^*(h_t, a_t), \]
Update the optimal value function at stage \(t\) as \[ V_t^*(h_t) = \max_{a_t} Q_t^*(h_t, a_t). \]

6 Example: Q-Learning on `bmiData`

We can implement the Q-learning algorithm on the bmiData dataset to estimate the optimal dynamic treatment regime for the two-stage meal replacement study. We will use linear regression models to estimate the Q-functions at each stage. Note that this is a minimization problem instead of a maximization problem since we want to minimize BMI.

  library(DynTxRegime)
  data("bmiData")

  # Stage 2 Q-function estimation
  bmiData$A2_num <- ifelse(bmiData$A2 == "MR", 1, 0)
  stage2_Q <- lm(month12BMI ~ month4BMI*A2_num, data = bmiData)

  # Predict Q2 values for both treatment options
  bmiData$Q2_MR <- predict(stage2_Q, newdata = transform(bmiData, A2_num = 1))
  bmiData$Q2_CD <- predict(stage2_Q, newdata = transform(bmiData, A2_num = 0))

  # Optimal treatment at stage 2
  bmiData$A2_opt <- ifelse(bmiData$Q2_MR < bmiData$Q2_CD, "MR", "CD")

  # value of optimal treatment at stage 2
  bmiData$V2_opt <- pmin(bmiData$Q2_MR, bmiData$Q2_CD)
  
  # Predict Q1 values for stage 1 using V2_opt as outcome
  bmiData$A1_num <- ifelse(bmiData$A1 == "MR", 1, 0)
  stage1_Q <- lm(V2_opt ~ (baselineBMI + gender + race + parentBMI)*A1_num, data = bmiData)

  # Predict Q1 values for both treatment options
  bmiData$Q1_MR <- predict(stage1_Q, newdata = transform(bmiData, A1_num = 1))
  bmiData$Q1_CD <- predict(stage1_Q, newdata = transform(bmiData, A1_num = 0))

  # Optimal treatment at stage 1
  bmiData$A1_opt <- ifelse(bmiData$Q1_MR < bmiData$Q1_CD, "MR", "CD")

  # value of optimal treatment at stage 1
  bmiData$V1_opt <- pmin(bmiData$Q1_MR, bmiData$Q1_CD)
  
  # compare the density plot of observed and optimal treatments
  library(ggplot2)
  ggplot() +
    geom_density(data = bmiData, aes(x = month12BMI, fill = "Observed"), alpha = 0.4) +
    geom_density(data = bmiData, aes(x = V1_opt, fill = "Optimal"), alpha = 0.4) +
    scale_fill_manual(values = c("Observed" = "deepskyblue", "Optimal" = "darkorange")) +
    labs(title = "Density of Observed vs Optimal BMI at Month 12",
         x = "Month 12 BMI", y = "Density", fill = "Legend") +
    xlim(25, 50) +
    theme_minimal()

This can also be implemented using the qLearn function in the package. We will consider a slightly different setting, in which all the immediate rewards \(R_t\) are zero and we simply observe the final reward at the last stage.

\[ \text{Reward} = -100*\frac{\text{month12BMI}-\text{baselineBMI}}{\text{baselineBMI}}. \]

We need to specify models at the second stage using functions defined in the package.

moMain specifies the main effect covariates (without interaction to the treatment)
moCont specifies covariates with interaction to the treatment at current cage

We then use the qLearn function to fit the model at the final stage. qLearn performs a single step of Q-learning algorithm. moMain and moCont together determines the covariates of the model, response refers to the final reward, and txName refers to the treatment variable name assigned at this stage. We can see that the obtained result is the same as our own model.

  # define the aggregated reward over two stages
  Y <- -100*(bmiData[,6L] - bmiData[,4L])/bmiData[,4L]

  # Specify models at the second stage
  moMain <- buildModelObj(model = ~ parentBMI + month4BMI,
                                                solver.method = 'lm')   # specify main effect 
  moCont <- buildModelObj(model = ~ race + parentBMI+month4BMI,
                                                solver.method = 'lm')   # specify interaction with treatment
  
  # Second-Stage Analysis
  fitSS <- qLearn(moMain = moMain, moCont = moCont,
                  data = bmiData, 
                  response = Y, 
                  txName = 'A2')
## First step of the Q-Learning Algorithm.
## 
## Outcome regression.
## Combined outcome regression model: ~ parentBMI+month4BMI + A2 + A2:(race+parentBMI+month4BMI) .
## Regression analysis for Combined:
## 
## Call:
## lm(formula = YinternalY ~ parentBMI + month4BMI + A2 + A2:race + 
##     parentBMI:A2 + month4BMI:A2, data = data)
## 
## Coefficients:
##    (Intercept)       parentBMI       month4BMI            A2MR       A2CD:race       A2MR:race  parentBMI:A2MR  month4BMI:A2MR  
##       48.04514        -0.35715        -0.83827       -14.54904        -0.25018         1.64329         0.39890         0.02297  
## 
## 
## Recommended Treatments:
##  CD  MR 
##  98 112 
## 
## Estimated value: 7.54403

  show(fitSS)
## Q-Learning: step 1 
## Outcome Regression Analysis
## Combined 
## 
## Call:
## lm(formula = YinternalY ~ parentBMI + month4BMI + A2 + A2:race + 
##     parentBMI:A2 + month4BMI:A2, data = data)
## 
## Coefficients:
##    (Intercept)       parentBMI       month4BMI            A2MR       A2CD:race       A2MR:race  parentBMI:A2MR  month4BMI:A2MR  
##       48.04514        -0.35715        -0.83827       -14.54904        -0.25018         1.64329         0.39890         0.02297  
## 
## Recommended Treatments:
##  CD  MR 
##  98 112 
## 
## Estimated value: 7.54403

  # you can extract the optimal treatment and decision function
  optTx(fitSS)$optimalTx[1:6]
## [1] MR CD CD CD CD CD
## Levels: CD MR
  optTx(fitSS)$decisionFunc[1:6,]
##             CD       MR
## [1,]  7.818515 8.553214
## [2,]  6.771519 5.095286
## [3,]  8.376248 6.696326
## [4,] 10.965522 8.130337
## [5,] 10.083604 8.743228
## [6,] 10.697010 8.576252

We then fit the model at the first stage. The only difference comparing with the second stage is that we should assign the model we gained from the previous stage fitSS to response, and the treatment assigned is A1 here. We can also check that the result we gained at this step is the same as our own model.

  # First-stage Analysis
  moMain <- buildModelObj(model = ~ parentBMI + baselineBMI,
                          solver.method = 'lm') # main effect

  moCont <- buildModelObj(model = ~ race + parentBMI + baselineBMI,
                          solver.method = 'lm') # interaction effects
  
  # fit the model at the first stage
  fitFS <- qLearn(moMain = moMain, moCont = moCont,
                  data = bmiData, 
                  response = fitSS, 
                  txName = 'A1')
## Step 2 of the Q-Learning Algorithm.
## 
## Outcome regression.
## Combined outcome regression model: ~ parentBMI+baselineBMI + A1 + A1:(race+parentBMI+baselineBMI) .
## Regression analysis for Combined:
## 
## Call:
## lm(formula = YinternalY ~ parentBMI + baselineBMI + A1 + A1:race + 
##     parentBMI:A1 + baselineBMI:A1, data = data)
## 
## Coefficients:
##      (Intercept)         parentBMI       baselineBMI              A1MR         A1CD:race         A1MR:race    parentBMI:A1MR  baselineBMI:A1MR  
##         36.73123          -0.08373          -0.68951           2.78926           0.59410           0.65941          -0.36938           0.23110  
## 
## 
## Recommended Treatments:
##  CD  MR 
## 121  89 
## 
## Estimated value: 8.374786

  optTx(fitFS)$optimalTx[1:6]
## [1] CD MR MR MR MR MR
## Levels: CD MR
  optTx(fitFS)$decisionFunc[1:6,]
##             CD        MR
## [1,]  9.967447  9.433248
## [2,]  8.482970  8.746729
## [3,]  8.794969  8.913004
## [4,]  9.119389 10.236224
## [5,] 11.088749 12.234857
## [6,]  8.981291  9.422195

  summary(fitFS)
## $outcome
## $outcome$Combined
## 
## Call:
## lm(formula = YinternalY ~ parentBMI + baselineBMI + A1 + A1:race + 
##     parentBMI:A1 + baselineBMI:A1, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1295 -1.1803 -0.0718  1.3649  4.1127 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      36.73123    1.74405  21.061  < 2e-16 ***
## parentBMI        -0.08373    0.03736  -2.241  0.02609 *  
## baselineBMI      -0.68951    0.04992 -13.811  < 2e-16 ***
## A1MR              2.78926    2.66905   1.045  0.29725    
## A1CD:race         0.59410    0.36658   1.621  0.10666    
## A1MR:race         0.65941    0.37014   1.782  0.07633 .  
## parentBMI:A1MR   -0.36938    0.04935  -7.484 2.17e-12 ***
## baselineBMI:A1MR  0.23110    0.07440   3.106  0.00217 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.851 on 202 degrees of freedom
## Multiple R-squared:  0.768,  Adjusted R-squared:  0.7599 
## F-statistic: 95.51 on 7 and 202 DF,  p-value: < 2.2e-16
## 
## 
## 
## $optTx
##  CD  MR 
## 121  89 
## 
## $value
## [1] 8.374786

7 A-Learning

Similar to our previous discussion on removing the mean function, there is a similar procedure called A-learning (advantage learning) that focuses on modeling the treatment contract rather than the full Q-function. The idea is to decompose the Q-function into two parts: the mean function \(m(h_t)\) which captures the expected outcome under a reference treatment, and the advantage function \(A_t \cdot \mu(h_t)\), which captures the additional benefit of choosing treatment \(A_t\) over the reference. In the simplest case with a binary treatment \(A_t \in \{-1, 1\}\), we can write the Q-function as

\[ Q_t(h_t, a_t) = m_t(h_t) + a_t \, \mu_t(h_t), \]

The optimal decision rule at stage \(t\) depends only on the sign of \(\mu_t(h_t)\): \[ \pi_t^*(h_t) = \operatorname{sign}\big(\mu_t(h_t)\big). \]

In practice, the advantage function \(\mu_t(h_t)\) can be estimated using an orthogonalized estimating equation:

\[ \E\!\left[ (A_t - e_t(H_t)) \Big(Y - m_t(H_t) - (A_t - e_t(H_t))\, \mu_t(H_t)\Big) \right] = 0. \]

This estimating actually serves an very important purpose: the residual

\[ Y_{T-1}^* = Y - m_T(H_T) - (A_T - e_T(H_T))\,\mu_T(H_T), \]

would be orthogonal to the treatment assignment \(A_{T-1}\), which removes the confounding effect from previous stages. This orthogonalization helps to isolate the treatment effect at each stage, leading to more robust estimation of the advantage function. At stage \(T-1\), the Bellman recursion optimizes

\[ \arg\max_{a_{T-1}} \E[V_T^*(H_T) \mid H_{T-1}, A_{T-1}=a_{T-1}], \]

where \(V_T^*(H_T) = \max_{a_T} Q_T(H_T, a_T)\). In A-learning, we replace \(V_T^*(H_T)\) with this mean-removed pseudo-outcome \(Y_{T-1}^*\), and because this residualization removes components uncorrelated with the current treatment, we have

\[ \arg\max_{a_{T-1}} \E[V_T^*(H_T)\mid H_{T-1},A_{T-1}] = \arg\max_{a_{T-1}} \E[Y_{T-1}^*\mid H_{T-1},A_{T-1}] \]

lead to the same optimal decision rule \(\pi_{T-1}^*\). A-learning thus preserves the same optimization objective as Q-learning, but expresses it through an orthogonalized outcome that isolates the treatment contrast.

8 Outcome Weighted Learning for DTRs

Starting from the G-formula (1), if we want to evaluate or optimize a new deterministic target policy \(\bpi = (\pi_1, \dots, \pi_T)\), its induced likelihood can be written as

\[ \underbrace{p(s_1) \, \One\{a_1 = \pi_1(s_1)\}}_{\text{initial stage}} \Bigg\{ \prod_{t=2}^T \underbrace{p(s_t \mid \bs_{t-1}, \ba_{t-1}) \, \One\{a_t = \pi_t(\bs_t, \ba_{t-1})\}}_{\text{transition and action at stage } t} \Bigg\} \underbrace{p( y \mid \bs_T, \ba_T)}_{\text{outcome model}}. \tag{2} \]

By taking the ratio of the target and observed likelihoods and applying the Radon–Nikodym theorem, the value function of the target policy under the observed data distribution can be written as:

\[ \cV_\bpi = \E_\bmu \!\left[ \frac{\prod_{t=1}^T \One\{ A_t = \pi_t(\bS_t, \bA_{t-1}) \}}{\prod_{t=1}^T \mu_t(A_t \mid \bS_t, \bA_{t-1})} \sum_{t=1}^T R_t \right], \] where the term inside the expectation corresponds to the Radon–Nikodym derivative weighting the observed outcome by the inverse of the behavior policy \(\bmu\). Here we assume additive stagewise rewards \(R_t\), so that the total reward is \(\sum_{t=1}^T R_t\). We also ignore the initial state distribution \(p(s_1)\) and consider the marginal population value.

For simplicity, we focus on a two-stage case (\(T = 2\)) and denote by \(H_t\) the observed history prior to stage \(t\), that is, \(H_1 = S_1\) and \(H_2 = (S_1, A_1, S_2)\).
Then the value function under policy \(\bpi\) can be written as

\[ \cV_\bpi = \E_\bmu \!\left[ \frac{ \One\{A_1 = \pi_1(H_1)\} \One\{A_2 = \pi_2(H_2)\} }{ \mu_1(A_1 \mid H_1) \mu_2(A_2 \mid H_2) } (R_1 + R_2) \right]. \tag{3} \]

We now introduce two optimization methods from Zhao et al. (2015) based on this formulation.

8.1 Simultaneous Outcome Weighted Learning

One approach is to optimize equation (3) jointly over both stages. This can be viewed as a multi-class classification problem, since for binary actions at each stage there are four possible treatment paths. To approximate the discontinuous indicator, a smooth surrogate loss is used, for example a hinge-type function \[ \psi(z_1, z_2) = \min(z_1, z_2, 1), \] where \(z_t = A_t \cdot f_t(H_t)\) for some functions \(f_1, f_2 \in \mathbb{R}\). The optimization problem can then be written as

\[ (\widehat f_1, \widehat f_2) = \argmax_{f_1, f_2} \frac{1}{n} \sum_{i=1}^n \frac{ \psi\!\big(A_{i1} f_1(H_{i1}), A_{i2} f_2(H_{i2})\big) }{ \widehat\mu_1(A_{i1} \mid H_{i1}) \, \widehat\mu_2(A_{i2} \mid H_{i2}) } (R_{i1} + R_{i2}) - \lambda \big(\|f_1\|^2 + \|f_2\|^2\big). \] The optimal decision rules are \(\hpi_1(H_1) = \operatorname{sign}\big(f_1(H_1)\big)\) and \(\hpi_2(H_2) = \operatorname{sign}\big(f_2(H_2)\big)\). Here each sample is weighted by \((R_{i1}+ R_{i2}) / \big[\widehat\mu_1(A_{i1} \mid H_{i1}) \widehat\mu_2(A_{i2} \mid H_{i2})\big]\).

8.2 Backward Outcome Weighted Learning

As the name suggests, backward outcome weighted learning proceeds by backward induction. Similar to Q-learning, we begin at the final stage \(T\), solving a single-stage OWL problem to obtain the optimal decision rule \(\hpi_T(H_T)\). Then, moving backward one stage, we optimize \(\pi_{T-1}\) by maximizing the expected weighted cumulative reward:

\[ \argmax_{\pi_t} \E \!\left[ \frac{ \prod_{j=t+1}^T \One\{A_j = \hpi_j(H_j)\} \, \One\{A_t = \pi_t(H_t)\} }{ \prod_{j=t}^T \mu_j(A_j \mid H_j) } \sum_{j=t}^T R_j \right]. \]

Hence, the procedure is analogous to batch Q-learning, except that OWL avoids explicitly modeling the Q-function by directly optimizing the value expression.

9 Reward Decomposition

In our previous formulation, we used the terminal outcome \(Y\) as the optimization target. A more general and often convenient representation is to decompose \(Y\) into stagewise rewards:

\[ Y = \sum_{t=1}^T R_t. \]

Under this formulation, we will eventually fall into the reinforcement learning setting. And the overall value of a policy \(\bpi\) can be written as

\[ V^\pi = \E_\pi\!\left[\sum_{t=1}^T R_t\right]. \]

The procedure for estimating the optimal dynamic treatment regime remains similar, with the Q-function at each stage defined as the expected cumulative reward from that stage onward. Starting from the last stage,

\[ Q_T(h_T, a_T) = \E[R_T \mid H_T = h_T, A_T = a_T], \quad V_T^\pi(h_T) = \E_{A_T \sim \pi_T(\cdot \mid h_T)}[Q_T(h_T, A_T)]. \]

And for earlier stages,

\[ Q_t(h_t, a_t) = \E\!\left[R_t + V_{t+1}^\pi(H_{t+1}) \mid H_t = h_t, A_t = a_t\right], \]

and

\[ V_t^\pi(h_t) = \E_{A_t \sim \pi_t(\cdot \mid h_t)}[Q_t(h_t, A_t)]. \]

The optimal regime is then defined as

\[ \begin{aligned} \pi_t^*(h_t) &= \arg\max_{a_t} Q_t^*(h_t, a_t) \\ &= \arg\max_{a_t} \E\!\left[R_t + \max_{a_{t+1}} Q_{t+1}^*(H_{t+1}, a_{t+1}) \mid H_t = h_t, A_t = a_t\right]. \end{aligned} \]

Kosorok, Michael R, and Eric B Laber. 2019. “Precision Medicine.” Annual Review of Statistics and Its Application 6: 263–86.

Laber, Eric B, Daniel J Lizotte, Min Qian, William E Pelham, and Susan A Murphy. 2014. “Dynamic Treatment Regimes: Technical Challenges and Applications.” Electronic Journal of Statistics 8 (1): 1225.

Murphy, Susan A. 2003. “Optimal Dynamic Treatment Regimes.” Journal of the Royal Statistical Society Series B: Statistical Methodology 65 (2): 331–55.

Robins, James M. 2004. “Optimal Structural Nested Models for Optimal Sequential Decisions.” In Proceedings of the Second Seattle Symposium in Biostatistics: Analysis of Correlated Data, 189–326. Springer.

Zhao, Ying-Qi, Donglin Zeng, Eric B Laber, and Michael R Kosorok. 2015. “New Statistical Learning Methods for Estimating Optimal Dynamic Treatment Regimes.” Journal of the American Statistical Association 110 (510): 583–98.

Dynamic Treatment Regimes and Q-Learning

Ruoqing Zhu

Last Updated: November 04, 2025

1 Overview

2 Notation and Model Assumption

2.1 Example: `bmiData` from `DynTxRegime`package

3 Causal Inference and G-formula

4 Q-Learning and Backward Induction

4.0.1 Peeling Off Stage \(T\):

4.0.2 Working Backwards:

5 Estimating the Optimal Policy

6 Example: Q-Learning on `bmiData`

7 A-Learning

8 Outcome Weighted Learning for DTRs

8.1 Simultaneous Outcome Weighted Learning

8.2 Backward Outcome Weighted Learning

9 Reward Decomposition

Dynamic Treatment Regimes and Q-Learning

Ruoqing Zhu

Last Updated: November 04, 2025

1 Overview

2 Notation and Model Assumption

2.1 Example: bmiData from DynTxRegimepackage

3 Causal Inference and G-formula

4 Q-Learning and Backward Induction

4.0.1 Peeling Off Stage \(T\):

4.0.2 Working Backwards:

5 Estimating the Optimal Policy

6 Example: Q-Learning on bmiData

7 A-Learning

8 Outcome Weighted Learning for DTRs

8.1 Simultaneous Outcome Weighted Learning

8.2 Backward Outcome Weighted Learning

9 Reward Decomposition

2.1 Example: `bmiData` from `DynTxRegime`package

6 Example: Q-Learning on `bmiData`