We have introduced the concept of dynamic treatment regimes (DTRs) in the previous section, and used the Q-learning approach to solve the optimal treatment regime. Similar to a one-stage problem, modeling the outcome (Q-functions) can be challenging because it may involve the effect of many variables. Outcome weighted learning (Y. Zhao et al. 2012) could alleviate this issue by directly modeling the decision rule, which can be more efficient. Recall that the likelihood of the observed data is (if we ignore the randomness of \(R\) and simply assume that it is a deterministic function of \(A\) and \(S\)):
\[ \underbrace{\P(s_0) \mu_0(a_0 | s_0)}_{\text{initial stage}} \bigg\{ \prod_{t=1}^T \underbrace{\P_t(s_t | \bs_{t-1}, \ba_{t-1}) \mu_t(a_t | \bs_t, \ba_{t-1})}_{\text{transition and action at $t$}} \bigg\} \underbrace{\P_{T+1}( s_{T+1} | \bs_T, \ba_T)}_{\text{final stage}} \tag{1} \]
On the other hand, if we want to propose a deterministic target policy, the likelihood is
\[ \underbrace{\P(s_0) \One\{a_0 = \pi_0(s_0)\} }_{\text{initial stage}} \bigg\{ \prod_{t=1}^T \underbrace{\P_t(s_t | \bs_{t-1}, \ba_{t-1}) \One\{ a_t = \pi_t(\bs_t, \ba_{t-1}) \}}_{\text{transition and action at $t$}} \bigg\} \underbrace{\P_{T+1}( s_{T+1} | \bs_T, \ba_T)}_{\text{final stage}} \tag{2} \]
Looking at the ratio of the two, and use the Radon-Nikodym theorem, the value function of the target policy under the observed data distribution can be written as
\[ \cV_\bpi = \E_\bmu \left[ \frac{\prod_{t=1}^T \One\{ A_t = \pi_t(\bS_t, \bA_{t-1}) \}}{\prod_{t=1}^T \mu_t(A_t | \bS_t, \bA_{t-1})} \sum_{t-1}^T R_T \right]. \]
Here, we use \(\sum_{t-1}^T R_T\) as the overall outcome, and weight it using the Radon-Nikodym derivative1.
For this lecture, let’s consider a slightly simplified notation. First, we will start from time point \(t = 1\) rather than 0, and we will only consider two stages, so \(T = 2\). Second, let’s denote all the “history” at time point \(t\) as \(H_t\), which includes all the information up to time \(t\), before the action was makde. Hence, for the initial stage, \(H_1 = S_1\), and for the second stage, \(H_2 = (\bS_2, \bA_1)\). With this simplification, the value function of policy \(\bpi\) using data observed from policy \(\bmu\) can be written as
\[ \cV_\bpi = \E_\bmu \left[ \frac{ \One\{ A_1 = \pi_1(H_1)\} \One\{ A_2 = \pi_2(H_2) \} }{\prod_{t=1}^2 \mu_t(A_t | H_t) } \Big(R_1 + R_2\Big) \right]. \tag{3} \]
We will introduce two methods proposed in Y.-Q. Zhao et al. (2015) based on this formulation.
Solving equation (3) directly is feasible. We can simply treat the problem as a multi-class classification problem. For example, if each stage has two treatment options, then we can treat the problem as a 4-class classification problem. However, this approach may not be efficient, especially when the number of actions is large. Consider a soft version of the loss function, which mimics the hinge loss in a one-dimensional situation.
\[ \psi(z_1, z_2) = \min(z_1, z_2, 1) \]
The idea is that we will take \(z_1 = A_1 \cdot f_1(H_1)\) and \(z_2 = A_2 \cdot f_2(H_2)\) for some functions \(f_1, f_2 \in \mathbb{R}\). By the view of SVM, the relationship between the separation margin and the hinge loss, whenever \(z_1\) or \(z_2\) is less than 1, the classification is not one margin away from the separation line ( \(z = 0\) ). Furthermore, we will incorporate the penalty on the normal of the complexity. The sample version of the optimization problem becomes
\[ (\widehat f_1, \widehat f_2) = \argmax_{f_1, f_2} \frac{1}{n} \left[ \frac{\psi\Big(A_{i1} \cdot f_1(H_{i1}), A_{i2} \cdot f_2(H_{i2})\Big)}{\prod_{t=1}^2 \widehat\mu_t(A_{it} | H_{it})} (R_{i1}+ R_{i2}) \right] - \lambda \left( \| f_1 \|^2 + \| f_2 \|^2 \right). \]
And the optimal decision rule is \(\hpi_1(H_1) = \text{sign}(f_1(H_1))\) and \(\hpi_2(H_2) = \text{sign}(f_2(H_2))\). This is again a weighted objective function, with \((R_{i1}+ R_{i2}) / \prod_{t=1}^2 \widehat\mu_t(A_{it} | H_t)\) being the subject weights.
As its name suggests, the backward outcome weighted learning is a backward induction process. Similarly to the batch Q-learning approach, we start from the last stage to solve for the best treatment rule in time \(T\). Note that the last stage only have outcome \(R_t\), given the history \(H_t\), this is essentially the single stage OWL problem. When the last stage is optimized, we have the optional decision rule \(\hpi_T(H_T)\). We then backup one stage and solve for the optimal decision rule at time \(T-1\). In general, solving the the optimal decision rule at time \(t\) is to maximize the value function by choosing the decision rule \(\pi_t\).
\[ \argmax_{\pi_t} \E \left[ \frac{ \prod_{j=t+1}^T \One\{A_j = \hpi_j(H_j) \} \cdot \One\{A_t = \pi_t(H_t) \} }{ \prod_{j=t}^T \mu_t(A_j | H_j) } \left( \sum_{j = t}^T R_j \right) \right] \]
Hence, the procedure is similar tp the batch Q-learning except that we do not explicitly model the Q-function.
The nice thing about the OWL approach is that it does not require the explicit modeling of the Q-function, which can be easily mis-specified. By only modeling the decision rule, the OWL approach can also be more efficient since the decision rule may only involve a few number of variables. Along the same line of thinking, there is another approach called Advantage Learning (A-Learning, Murphy (2003); Schulte et al. (2014)). It aims at modeling the advantage function (see our previous lecture) which is the difference between the Q-function and the value function.
Another topic worth to discuss is the propensity score. Recall that we needed the positivity assumption for the average treatment effect to be estimated, and it is also the case here in OWL. When a certain treatment label is not used at any given stage for any particular history value, the propensity score for that treatment is 0, and since it is at the denominator, the weight would be infinite. Hence, estimation of the propensity score can greatly affect the performance of the OWL.
Here we consider a case that we ignored the initial state \(S_0\) and simply take its expectation. Hence this can be viewed as the value on the entire population, rather than those who start from \(S_0\).↩︎