\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\cA}{\mathcal{A}}\) \(\newcommand{\cB}{\mathcal{B}}\) \(\newcommand{\cC}{\mathcal{C}}\) \(\newcommand{\cD}{\mathcal{D}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cF}{\mathcal{F}}\) \(\newcommand{\cG}{\mathcal{G}}\) \(\newcommand{\cH}{\mathcal{H}}\) \(\newcommand{\cI}{\mathcal{I}}\) \(\newcommand{\cJ}{\mathcal{J}}\) \(\newcommand{\cK}{\mathcal{K}}\) \(\newcommand{\cL}{\mathcal{L}}\) \(\newcommand{\cM}{\mathcal{M}}\) \(\newcommand{\cN}{\mathcal{N}}\) \(\newcommand{\cO}{\mathcal{O}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\cQ}{\mathcal{Q}}\) \(\newcommand{\cR}{\mathcal{R}}\) \(\newcommand{\cS}{\mathcal{S}}\) \(\newcommand{\cT}{\mathcal{T}}\) \(\newcommand{\cU}{\mathcal{U}}\) \(\newcommand{\cV}{\mathcal{V}}\) \(\newcommand{\cW}{\mathcal{W}}\) \(\newcommand{\cX}{\mathcal{X}}\) \(\newcommand{\cY}{\mathcal{Y}}\) \(\newcommand{\cZ}{\mathcal{Z}}\) \(\newcommand{\bA}{\mathbf{A}}\) \(\newcommand{\bB}{\mathbf{B}}\) \(\newcommand{\bC}{\mathbf{C}}\) \(\newcommand{\bD}{\mathbf{D}}\) \(\newcommand{\bE}{\mathbf{E}}\) \(\newcommand{\bF}{\mathbf{F}}\) \(\newcommand{\bG}{\mathbf{G}}\) \(\newcommand{\bH}{\mathbf{H}}\) \(\newcommand{\bI}{\mathbf{I}}\) \(\newcommand{\bJ}{\mathbf{J}}\) \(\newcommand{\bK}{\mathbf{K}}\) \(\newcommand{\bL}{\mathbf{L}}\) \(\newcommand{\bM}{\mathbf{M}}\) \(\newcommand{\bN}{\mathbf{N}}\) \(\newcommand{\bO}{\mathbf{O}}\) \(\newcommand{\bP}{\mathbf{P}}\) \(\newcommand{\bQ}{\mathbf{Q}}\) \(\newcommand{\bR}{\mathbf{R}}\) \(\newcommand{\bS}{\mathbf{S}}\) \(\newcommand{\bT}{\mathbf{T}}\) \(\newcommand{\bU}{\mathbf{U}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\ba}{\mathbf{a}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\bd}{\mathbf{d}}\) \(\newcommand{\be}{\mathbf{e}}\) \(\newcommand{\bg}{\mathbf{g}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bi}{\mathbf{i}}\) \(\newcommand{\bj}{\mathbf{j}}\) \(\newcommand{\bk}{\mathbf{k}}\) \(\newcommand{\bl}{\mathbf{l}}\) \(\newcommand{\bm}{\mathbf{m}}\) \(\newcommand{\bn}{\mathbf{n}}\) \(\newcommand{\bo}{\mathbf{o}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bq}{\mathbf{q}}\) \(\newcommand{\br}{\mathbf{r}}\) \(\newcommand{\bs}{\mathbf{s}}\) \(\newcommand{\bt}{\mathbf{t}}\) \(\newcommand{\bu}{\mathbf{u}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\hpi}{\widehat{\pi}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\hbpi}{\widehat{\boldsymbol{\pi}}}\) \(\newcommand{\bxi}{\boldsymbol{\xi}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bzero}{\mathbf{0}}\) \(\newcommand{\T}{\text{T}}\) \(\newcommand{\Trace}{\text{Trace}}\) \(\newcommand{\Cov}{\text{Cov}}\) \(\newcommand{\Var}{\text{Var}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\Pr}{\text{Pr}}\) \(\newcommand{\pr}{\text{pr}}\) \(\newcommand{\pdf}{\text{pdf}}\) \(\newcommand{\P}{\text{P}}\) \(\newcommand{\p}{\text{p}}\) \(\newcommand{\One}{\mathbf{1}}\) \(\newcommand{\argmin}{\operatorname*{arg\,min}}\) \(\newcommand{\argmax}{\operatorname*{arg\,max}}\)

Overview

Previously we mainly focused on making inference about the treatment effect and decision rules at a single decision time. However, many real life applications would require us to make a sequence of decisions over time. In medical setting, this is known as dynamic treatment regime (DTR). For example,

  • Cancer treatment often involves multiple stages, including surgery, chemotherapy, radiation, and immunotherapy. For example, In treating advanced nonsmall cell lung cancer (NSCLC), the treatment plan for chemotherapy may involve first- and second-line compounds (Zhao et al. 2011). After the first line of treatment, a patient may experience disease progression, and the a second line treatments will be considered.

But of course the application is not limited to the medical field. Dynamic Treatment Regimes (DTRs) find applications beyond healthcare, such as in agriculture for optimizing crop yields through adaptive interventions, in finance for tailoring investment strategies to evolving market conditions, and in education for customizing learning paths to students’ changing needs. In these settings, the sequential decision-making model utilize time-varying information. We will focus on a specific setting where the system is non-stationary, non-Markovian and have a finite number of decision point. We will introduce a method called the Q-learning in the batch learning setting. This method uses dynamic programming idea to solve the optimal treatment regime backwards. These terminologies and methods will be explained in the following. Most of the content in this note is based on Murphy (2005).

Sample Trajectory

The terminology of “batch” learning refers to the setting that we have a set of already collected data. This is in contrast to the online learning setting where we collect data as we go. Under this setting, we have a set of \(n\) i.i.d observations, each is composed of a sequence \(\{ S_0, A_0, R_0, S_1, A_1, R_1, \ldots, S_T, A_T, R_T, S_{T+1} \}\)1, where \(S_t\) is the state at time \(t\), \(A_t\) is the action taken at time \(t\), and \(R_t\) is the reward2 received after making the decision \(A_t\) at time \(t\). We are going to assume that the number of states \(T\) is finite, and the action space is binary, i.e., \(A_t \in \{0, 1\}\).

Similar to our previous definition of the treatment effect, we can define the value function as the expected cumulative reward, given that we start at state \(S_0\) and take actions based on a certain policy rule. Formally, the value function is defined as

\[ \cV^\bpi(s) = \E_\bpi\left[ \sum_{t=0}^T R_t \bigg| S_0 = s \right] \tag{1} \]

But there is in fact a lot of thing going on in this notation. It might be helpful to look at how a trajectory is generated. Here are some notations that we will use:

  • \(\bS_t\) is the history of all state information up to time \(t\), i.e., \(\bS_t = \{S_0, S_1, \cdots, S_t\}\).
  • \(\bA_t\) is the history of all action information up to time \(t\), i.e., \(\bA_t = \{A_0, A_1, \cdots, A_t\}\).

Then, at each state, we have the following mechanism:

  • Given history \(\bS_t\) and \(\bA_{t-1}\), we take an action \(A_t\) according to some (behavior) policy \(\mu_t(A_t | \bS_t, \bA_{t-1})\).
  • Given history \(\bS_t\) and \(\bA_{t-1}\), after taking action \(A_t\), the system moves to the next state \(S_{t+1}\) according to some transitional model \(\P(S_{t+1} | \bS_t, \bA_t)\).
  • Given history \(\bS_t\), \(\bA_t\), and possibly3 the new state \(S_{t+1}\), we observe reward \(R_{t+1}\) according to some reward model \(r(\bS_t, \bA_t, S_{t+1})\).

Based how the state and actions are generated, we can write down the likelihood of the trajectory if we have the observed trajectories \(\bs_t = \{s_0, s_1, \cdots, s_{t} \}\) and \(\ba_t = \{a_0, a_1, \cdots, a_t\}\):

\[ \underbrace{\P(s_0) \mu_0(a_0 | s_0)}_{\text{initial stage}} \bigg\{ \prod_{t=1}^T \underbrace{\P_t(s_t | \bs_{t-1}, \ba_{t-1}) \mu_t(a_t | \bs_t, \ba_{t-1})}_{\text{transition and action at $t$}} \bigg\} \underbrace{\P_{T+1}( s_{T+1} | \bs_T, \ba_T)}_{\text{final stage}} \tag{2} \]

On the other hand, we may also propose our own policy \(\bpi\) which corresponds to a sequence of action rules \(\bpi = \{\pi_0, \pi_1, \cdots, \pi_T\}\). Then, the likelihood of the trajectory under the policy \(\bpi\) is

\[ \underbrace{\P(s_0) \pi_0(a_0 | s_0)}_{\text{initial stage}} \bigg\{ \prod_{t=1}^T \underbrace{\P_t(s_t | \bs_{t-1}, \ba_{t-1}) \pi_t(a_t | \bs_t, \ba_{t-1})}_{\text{transition and action at $t$}} \bigg\} \underbrace{\P_{T+1}( s_{T+1} | \bs_T, \ba_T)}_{\text{final stage}} \tag{3} \]

Recall that we discussed the difference between deterministic and stochastic policies in the one time-point case with outcome weighted learning. The above equation is an example of the stochastic policy, meaning that each \(\pi_t\) is a probability distribution over the action space. For example, given the history \(\bS_t\) and \(\bA_{t-1}\), the policy \(\pi_t\) might be a Bernoulli distribution with 90% change to take action 1 and 10% chance to take action 0. On the other hand, a deterministic policy would be a policy that always takes the same action given the same history. The likelihood for a deterministic policy sequence \(\bpi\) would be

\[ \underbrace{\P(s_0) \One\{a_0 = \pi_0(s_0)\} }_{\text{initial stage}} \bigg\{ \prod_{t=1}^T \underbrace{\P_t(s_t | \bs_{t-1}, \ba_{t-1}) \One\{ a_t = \pi_t(\bs_t, \ba_{t-1}) \}}_{\text{transition and action at $t$}} \bigg\} \underbrace{\P_{T+1}( s_{T+1} | \bs_T, \ba_T)}_{\text{final stage}} \tag{4} \]

The interpretation of the value function defined in (1) is the expected cumulative reward, given that we start at state \(S_0\) and take actions based on the planned sequential policy \(\bpi\), and sum all the rewards observed. We need to realize that the difficulty here is we do not observe samples generated from the policy \(\bpi\) that we are interested, but rather, we only have access to the value of the behavior policy \(\bmu\). This is known as a off-line learning problem.

Value- and Q-Functions

The value function defined in (1) is a function of the state \(s\), but the expectation is taken over the randomness caused by the transition \(\P_t(s_t | \bs_{t-1}, \ba_{t-1})\) and the policy \(\pi_t(a_t | \bs_t, \ba_{t-1})\). And our ultimate goal is to find the optimal policy \(\bpi^*\) that maximizes the value function, i.e.,

\[ \cV^\ast(s) = \max_{\bpi} \cV^\bpi(s) \tag{5} \]

To facilitate later argument, we will introduce several new concepts. We can define the value at a stage \(t\), given all the information available before making the decision at time \(t\), which is

\[ \cV^\bpi_t(\bs_t, \ba_{t-1}) = \E_\bpi\left[ \sum_{j=t}^T R_j \bigg| \bS_t = \bs_t, \bA_{t-1} = \ba_{t-1} \right] \tag{6} \]

Here we will do a little trick by separating \(\sum_{j=t}^T R_j\) into two parts: \(R_t + \sum_{j=t+1}^T R_j\). Then, we can re-write the \(\sum_{j=t+1}^T R_j\) part by further conditioning on the information available by next stage: \(\bS_{t+1}\) and \(\bA_t\), which becomes the value function at the next stage. This trick is used in the Bellman equation (Sutton and Barto 2018).

\[ \begin{aligned} \cV^\bpi_t(\bs_t, \ba_{t-1}) &= \E_\bpi\left[ R_t + \sum_{j=t+1}^T R_j \bigg| \bS_t = \bs_t, \bA_{t-1} = \ba_{t-1} \right] \\ &= \E_\bpi\left[ R_t + \E_\bpi\left[\sum_{j=t+1}^T R_j \Big| \bS_{t+1}, \bA_t \right]\bigg| \bS_t = \bs_t, \bA_{t-1} = \ba_{t-1} \right] \\ &= \E_\bpi\left[ R_t + \cV^\bpi_{t+1}(\bS_{t+1}, \bA_t) \bigg| \bS_t = \bs_t, \bA_{t-1} = \ba_{t-1} \right] \end{aligned}\tag{7} \]

Note that this value function does not explicitly depend on the action \(a_t\) we make at stage \(t\). It integrates4 over all possible actions we can take at stage \(t\), by the proposed policy \(\bpi\) and also all future values of the states and actions. On the other hand, if we explicitly given the action \(a_t\), then this value function becomes the Q-function, also called the action-value function, for \(t = 0, 1, \ldots, T\)5:

\[ Q^\bpi_t(\bs_t, \ba_t) = \E_\bpi\left[ R_t + \cV^\bpi_{t+1}(\bS_{t+1}, \bA_t) \bigg| \bS_t = \bs_t, \bA_t = \ba_t \right] \tag{8} \]

The Bellman equation is a recursive equation that relates the value (or Q) function at stage \(t\) to the value function at stage \(t+1\). Recall that our goal is to find the optimal policy \(\bpi^*\) that maximizes the value function, i.e.,

\[ \begin{aligned} \cV^\ast_t(\bs_t, \ba_{t-1}) &= \max_{\bpi} \cV^\bpi_t(\bs_t, \ba_{t-1}) \\ &= \max_{a_t} \E \left[ R_t + \cV^\ast_{t+1}(\bS_{t+1}, \bA_t) \bigg| \bS_t = \bs_t, \bA_t = \ba_t \right] \end{aligned} \]

The interpretation of the second line is that if we follow the optimal policy starting from time point \(t\), then, it is the same as picking the policy \(a_t\) at time \(t\) that maximizes the sum of the immediate reward \(R_t\) and the best possible cumulative outcome after \(t\), i.e., \(\cV^\ast_{t+1}\). Correspondingly, the optimal Q-function is further specifying the action \(a_t\) at time \(t\) while ensuring that the rest of the trajectory is optimal:

\[ Q^\ast_t(\bs_t, \ba_t) = \E \left[ R_t + \cV^\ast_{t+1}(\bS_{t+1}, \bA_t) \bigg| \bS_t = \bs_t, \bA_t = \ba_t \right] \tag{9} \]

This has an important implication: If we want to find the optimal treatment rule at time \(t\), we can simply look at the action that maximizes the Q-function at time \(t\), if we can make sure that we have the best policy reward \(\cV^\ast_{t+1}\) after time \(t\). Formally, this is represented as

\[ \pi_t^\ast(\bs_t, \ba_{t-1}) = \argmax_{a_t} Q^\ast_t(\bs_t, \ba_t) \tag{9} \]

This allows us to develop a dynamic programming algorithm to solve the optimal treatment regime backwards.

Q-learning via Dynamic Programming

The simple idea of this batch Q-learning algorithm is to start from the last stage and move backward. Here is a brief summary of the idea:

  • Realizing that in the last stage, there is no future dynamic that we need to worry about, it is simply a regression problem that regression \(R_T\) to all available information at time \(T\).
  • Once we learned that, choosing the optimal policy \(\pi_T\) is simply comparing all possible actions at the last stage based on that regression function. And hence we know the best value we can do at that stage, i.e., \(\cV^\ast_{T}(\bS_{t}, \bA_{t-1})\).
  • Plug this optimal value back to equation (9) for \(T-1\), this becomes a new regression problem that uses the outcome \(R_{T-1} + \cV^\ast_{T}\). This gives us the \(Q\) function at stage \(T-1\).
  • Repeat this process until we reach the first stage and the entire sequence of the best treatment policy is obtained.

Statistically, we usually specify some functional form of the Q-function. Let’s say we have a linear model, indexed by a set of vectors parameters \(\btheta = \{ \theta_1, \theta_2, \ldots, \theta_T \}\), we can first solve the last stage:

\[ \widehat\theta_T = \argmin_{\theta} \E_n \left[ R_T - \bQ_T(\bS_t, \bA_{t-1}, A_t; \theta) \right]^2 \]

where we can specify \(\bQ_T(\bS_t, \bA_{t-1}, A_t; \theta)\) as a linear regression function with covariates \(\bS_t\), \(\bA_{t-1}\), and \(A_t\). Then we can move backward one more stage, with the general formula that solve each \(\theta_t\) with \(t = T-1, \ldots, 0\) using

\[ \widehat\theta_t = \argmin_{\theta} \E_n \Big[ \underbrace{R_t + \max_{a_{t+1}} \bQ_{t+1}(\bS_{t+1}, \bA_t, A_{t+1} = a_{t+1}; \widehat\theta_{t+1})}_{\text{pseudo outcome at Stage $t$}} - \bQ_t(\bS_t, \bA_t; \theta) \Big]^2 \]

Generalization Error

The estimated policy would be of high quality if we estimate the Q-function accurately. However, how does that performs compared with the optimal policy, especially we also have to evaluate this by the training data we collect. This can be formally analyzed by looking at

\[ \cV^\ast(s_0) - \cV^\bpi(s_0) \]

for any \(\pi\). In practice, we may think of this \(\bpi\) as the one that we estimated, e.g., \(\hbpi\) based on the training data. And Murphy (2005) established a bound on this difference as

Theorem 1 (Generalization Error) \[ \cV^\ast(s_0) - \cV^\bpi(s_0) \leq \sum_{t = 0}^T 2 L^{(t + 1)/2} \sqrt{ \E_\mu \left[ \Big( Q^\bpi_t(\bS_t, \bA_t) - Q^\ast_t(\bS_t, \bA_t) \Big)^2 \bigg| S_0 = s_0 \right] } \]

This suggests that the error would accumulate over time, and potentially getting worse as the trajectory gets longer if the Q-function associated with our \(\bpi\) is not close to the optimal Q-function. There are a few quantities we need to define:

  • The expectation \(\E_\mu\) is in terms of the distribution of the training data, i.e., the behavior policy \(\mu\) in Equation (2).
  • \(L\) is a constant to bound the chance of observing any action at any time point, i.e., \(\mu(a_t | \bS_t, \bA_{t-1}) \geq L^{-1}\).
  • \(Q^\ast_t\) is the optimal Q-function at stage \(t\), associated with the optimal policy \(\bpi^\ast\). It is obvious that \(\pi^\ast_t = \argmax_{a_t} Q_t^\ast(\bS_t, \bA_t)\) at stage \(t\).
  • \(Q^\bpi_t\) is the policy that we want to compare with. For example, if we have an estimated optimal policy \(\hbpi\) from the training data, then \(Q^\bpi_t\) is the Q-function that this policy maximizes6.

In the following, we will show a modified version of the proof provided in Murphy (2005), specifically adapted to comparing the optimal policy \(\hbpi^\ast\)and \(\hbpi\). To be able to establish the result, let’s first show the following:

Lemma 1 (Telescoping Advantage) \[ \cV^\ast(s_0) - \cV^\bpi(s_0) = - \E_\bpi \left[ \sum_{t=0}^T \delta^\ast_t(\bS_t, \bA_t) \bigg| S_0 = s_0 \right], \]

where \(\delta^\bpi_t(\bS_t, \bA_t)\) is called the advantage function associated with any policy \(\bpi\) defined as

\[ \delta^\bpi_t(\bs_t, \ba_t) = Q_t^\bpi(\bs_t, \ba_{t-1}, a_t) - \cV_t^\bpi(\bs_t, \ba_{t-1}), \tag{10} \]

which is how well choosing any action \(a_t\) is, compared with the value function which implements the policy \(\bpi\). Note if a policy is derived from maximizing the Q-function, then the advantage function is always non-positive. For example, in the optimal policy case, there is no other action choice that can be better than the optimal value. This is also true for our arbitrary policy \(\bpi\) (which may come from estimating from the data), becuase we explicitly define the Q-function to be the one that this policy maximizes. One can think of this Q-function as the population version of the estimated Q-function from our data. But of course, that Q-function can be biased, which leads to a bad \(\bpi\) when we actually implement it.

Proof (Lemma: Telescoping Advantage). Note that

\[ \cV^\bpi(s_0) = \E_\bpi \left[ \sum_{t=0}^T R_t \bigg| S_0 = s_0 \right] = \E_\bpi \left[ \E_\bpi \left[ \sum_{t=0}^T R_t \Big| \bS_T, \bA_t \right] \bigg| S_0 = s_0 \right] \]

Here, we further conditioning by the random varaibles \(\bS_T, \bA_T\) within the conditional expectation. An interesting thing here is that once \(\bS_T, \bA_T\) is given, the distribution of this trajectory is fixed and do not depend on the policy anymore. This allows us to re-write the expectation \(\E_\bpi\) into anything else. The only thing that is uncertain here is \(R_T\) which comes from observing \(\bS_{T+1}\). Again, this does not depend on the policy since we already made our final action \(A_T\). Hence, we have

\[ \begin{aligned} \E_\bpi \left[ \sum_{t=0}^T R_t \Big| \bS_T, \bA_t \right] &= \E_{\bpi^\ast} \left[ \sum_{t=0}^T R_t \Big| \bS_T, \bA_t \right] \\ &= \sum_{t=0}^{T-1} R_t + \E_{S_{T+1}} \left[ R_T \Big| \bS_T, \bA_t \right] \\ &= \sum_{t=0}^{T-1} R_t + Q^\ast_T(\bS_T, \bA_t) \\ \end{aligned} \]

Now, we will add and subtract a sequence of terms to the above equation, and obtain

\[ \begin{aligned} \E_\bpi \left[ \sum_{t=0}^T R_t \Big| \bS_T, \bA_t \right] &= \sum_{t = 0}^T \Big[ Q^\ast_t(\bS_t, \bA_t) - \cV^\ast_t(\bS_t, \bA_{t-1}) \Big]\\ & \quad + \sum_{t=0}^{T-1} \Big[ R_{t} + \cV^\ast_{t+1}(\bS_{t+1}, \bA_t) - Q^\ast_t(\bS_t, \bA_t) \Big] \\ & \quad + \cV^\ast_0(S_0) \end{aligned} \]

The second term is zero because of the defintion of Q-function. The first term is the advantage function of the optimal policy, and the last term is the value function \(\cV^\ast(s_0)\) of the optimal policy \(\bpi^\ast\).

This completes the proof. \(\square\)

Now, to proof our theorem

Proof (Theorem: Generalization Error). Noticing that the expectation in the telescoping advantage equation is with respect to the policy \(\bpi\) that maximizes the corresponding \(Q^\bpi\), we realize that under this policy, if we take the action \(a_t = \pi(\bs_t, \ba_{t-1})\) at stage \(t\), then the advantage function is exactly 0, i.e.,

\[ \begin{aligned} \delta^\bpi_t(\bs_t, \ba_t) &= Q^\bpi_t(\bs_t, \ba_{t-1}, a_t = \pi(\bs_t, \ba_{t-1})) - \max_{a_t} Q^\bpi_t(\bs_t, \ba_{t-1}, a_t) \\ &= 0 \end{aligned} \]

Hence, we have

\[ \cV^\ast(s_0) - \cV^\bpi(s_0) = \sum_{t=0}^T \E_\bpi \left[ \delta^\bpi_t(\bS_t, \bA_t) - \delta^\ast_t(\bS_t, \bA_t) \bigg| S_0 = s_0 \right] \]

where we can bound each term as (keep in mind that the expectation is w.r.t \(\bpi\)):

\[ \begin{aligned} & \quad \,\, \big|\delta^\bpi_t(\bS_t, \bA_t) - \delta^\ast_t(\bS_t, \bA_t)\big| \\ &= Q^\bpi_t(\bS_t, \bA_{t-1}, A_t = \bpi_t) - \max_{a_t} Q^\bpi_t(\bS_t, \bA_{t-1}, a_t) - Q^\ast_t(\bS_t, \bA_{t-1}, A_t = \bpi_t) + \max_{a_t} Q^\ast_t(\bS_t, \bA_{t-1}, a_t) \\ &\leq \big| Q^\bpi_t(\bS_t, \bA_{t-1}, A_t = \bpi_t) - Q^\ast_t(\bS_t, \bA_{t-1}, A_t = \bpi_t) \big| + \max_{a_t} \big| Q^\bpi_t(\bS_t, \bA_{t-1}, a_t) - Q^\ast_t(\bS_t, \bA_{t-1}, a_t) \big| \\ &\leq 2 \times \max_{a_t} \big| Q^\bpi_t(\bS_t, \bA_{t-1}, a_t) - Q^\ast_t(\bS_t, \bA_{t-1}, a_t) \big| \\ &\leq 2 L \times \sum_{a_t} \big| Q^\bpi_t(\bS_t, \bA_{t-1}, a_t) - Q^\ast_t(\bS_t, \bA_{t-1}, a_t) \big| \cdot \mu(a_t | \bS_t, \bA_{t-1}) \\ &= 2 L \times \E_{A_t \sim \mu_t} \left[ \big| Q^\bpi_t(\bS_t, \bA_{t-1}, A_t) - Q^\ast_t(\bS_t, \bA_{t-1}, A_t) \big| \right] \end{aligned} \]

Note that the second last \(\leq\) sign is becuase we assumed \(\mu_t(a_t | \bS_t, \bA_{t-1}) \geq L^{-1}\). And \(L \mu_t(a_t | \bS_t, \bA_{t-1})\) would be greater than 1, meaning that we can simply adding all differences of \(Q^\bpi_t\) and \(Q^\ast_t\) over all possible actions, which will greater than the maximum difference. The last step is to take care of the rest of the expaction, i.e., \(a_1\) to \(a_{t-1}\) which still follows the policy \(\bpi\). Comparing the density functions defined in (2) and (4), we will use the Radon-Nikodym derivative again (see the lecture on outcome weighted learning), and swich the expectation entirely to \(\E_\bmu\):

\[ \begin{aligned} & \quad \,\, 2L \sum_{t = 0}^T \E_{A_1, \ldots, A_{t-1} \sim \bpi} \bigg[ \E_{A_t \sim \bmu} \Big[ \big| Q^\bpi_t(\bS_t, \bA_t) - Q^\ast_t(\bS_t, \bA_t) \big| \Big] \bigg| S_0 = s_0 \bigg] \\ &= 2L \sum_{t = 0}^T \E_\bmu \left[ \prod_{l = 0}^{t-1} \frac{1\{A_l = \pi_l\}}{\mu_l(A_l | \bS_l, \bA_{l-1})} \big| Q^\bpi_t(\bo_t, \ba_{t-1}, A_t) - Q^\ast_t(\bo_t, \ba_{t-1}, A_t) \big| \bigg| S_0 = s_0 \right] \\ &= 2L \sum_{t = 0}^T \E_\bmu \left[ L^{t} \big| Q^\bpi_t(\bo_t, \ba_{t-1}, A_t) - Q^\ast_t(\bo_t, \ba_{t-1}, A_t) \big| \bigg| S_0 = s_0 \right] \\ &= 2 \sum_{t = 0}^T L^{t+1} \E_\bmu \left[ \big| Q^\bpi_t - Q^\ast_t \big| \bigg| S_0 = s_0 \right] \\ \end{aligned} \]

If we instead use the Cauchy–Schwarz inequality, we have (with a bit abuse of notation)

\[ \begin{aligned} & \quad \,\, \sum_{t = 0}^T \E_\bpi \bigg[ \max_{a_t} \big| Q^\bpi_t(a_t) - Q^\ast_t(a_t) \big| \bigg| S_0 = s_0 \bigg] \\ & \leq \sum_{t = 0}^T \sqrt{ \E_\bpi \bigg[ \max_{a_t} \big( Q^\bpi_t(a_t) - Q^\ast_t(a_t) \big)^2 \bigg| S_0 = s_0 \bigg] } \\ & \leq \sum_{t = 0}^T \sqrt{ \E_\bmu \bigg[ L^{t+1} \big( Q^\bpi_t - Q^\ast_t \big)^2 \bigg| S_0 = s_0 \bigg] } \\ &= \sum_{t = 0}^T L^{(t+1)/2} \sqrt{ \E_\bmu \bigg[ \big( Q^\bpi_t - Q^\ast_t \big)^2 \bigg| S_0 = s_0 \bigg] }. \end{aligned} \]

This completes the proof. \(\square\)

Discussion

The implication of the theorem is that, we can (almost) evaluate the performance of our estimated policy \(\bpi\) with the optimal policy \(\bpi^\ast\), and if we evaluate them on the distribution of the training data, the difference is increasing as the trajectory gets longer and also increasing if our corresponding Q-function is biased. This raises the question, when our Q-function is at the risk of being biased (say using linear regression, while the truth is not), and also the trajectory is very long, is it a good idea to use the Q-learning?

The root cause of this trouble is that in Q-learning, different stages are being viewed as almost independent, and the error in the Q-function at one stage can propagate to the next stage. This is especially true when the trajectory is long, and the Q-function is estimated with a high variance. This is a common issue in reinforcement learning. However, when we introduce additional assumptions, such as the time-homogeneous Markov assumption, we actually do not need the entire \(\bS_t, \bA_t\) to estimate the policy function. Instead, we only need the current state \(\bS_t\). Often times, this would fall into the category of infinite-horizon MDP setting. Some later methods like Ertefaie and Strawderman (2018) and Luckett et al. (2020) were proposed to utilize these addtional assumptions.

Still, this idea of Q-learning is very powerful, and it has been widely used in many applications. An important application is the Sequential, Multiple Assignment, Randomized Trial Designs (SMART) in clinical trials, where the goal is to find the optimal treatment regime for a chronic disease over multiple decision stages. The SMART design would use randomized assignment of treatment at each stage (hence the upper bound \(L\) is small), and the Q-learning algorithm can be used to estimate the optimal treatment regime. This method is already accepted by the FDA for drug development (Lei et al. 2012; Kidwell and Almirall 2023).

Reference

Ertefaie, Ashkan, and Robert L Strawderman. 2018. “Constructing Dynamic Treatment Regimes over Indefinite Time Horizons.” Biometrika 105 (4): 963–77.
Kidwell, Kelley M, and Daniel Almirall. 2023. “Sequential, Multiple Assignment, Randomized Trial Designs.” Jama 329 (4): 336–37.
Lei, Huitan, Inbal Nahum-Shani, Kevin Lynch, David Oslin, and Susan A Murphy. 2012. “A" SMART" Design for Building Individualized Treatment Sequences.” Annual Review of Clinical Psychology 8: 21–48.
Luckett, Daniel J, Eric B Laber, Anna R Kahkoska, David M Maahs, Elizabeth Mayer-Davis, and Michael R Kosorok. 2020. “Estimating Dynamic Treatment Regimes in Mobile Health Using v-Learning.” Journal of the American Statistical Association 115 (530): 692–706.
Murphy, Susan A. 2005. “A Generalization Error for q-Learning.” The Journal of Machine Learning Research 6: 1073–97.
Sutton, Richard S, and Andrew G Barto. 2018. Reinforcement Learning: An Introduction. MIT press.
Zhao, Yufan, Donglin Zeng, Mark A Socinski, and Michael R Kosorok. 2011. “Reinforcement Learning Strategies for Clinical Trials in Nonsmall Cell Lung Cancer.” Biometrics 67 (4): 1422–33.

  1. In many cases, \(S_{T+1}\) is not needed. This is mainly because some models assume that \(R_t\) is a function that involves \(S_{t+1}\). Please see the definition of reward.↩︎

  2. In some literature, the reward at stage \(t\) is denoted as \(R_{t+1}\), which is calculated based on the observed state \(S_{t+1}\). However, this is only a notation issue, and does not change the dynamic of this model.↩︎

  3. It is not always necessary that the reward has to depend on \(S_{t+1}\). It could be some random function just based on the history up to \(t\). In addition, the reward can either be deterministic or random. However, in the case of random reward, the randomness given \(\bS_t, \bA_t, S_{t+1}\) has to be independent of all observed history. Otherwise, we will be facing unobserved confounder situation.↩︎

  4. This is more natural in a stochastic version, which we will take the expectation with respect to the distribution \(\bpi\). In a deterministic version, \(\bpi\) collapses into a single value.↩︎

  5. For \(t = T\) we can define \(\cV^\bpi_{T+1}\) to be 0 since it does not really exist.↩︎

  6. But this is not the one estimated from the training data, since the estimated \(\widehat Q_t\) would still contain errors. This \(Q^\bpi_t\) is a version without error if \(\hbpi\) is the corresponding optimal policy.↩︎