\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\cA}{\mathcal{A}}\) \(\newcommand{\cB}{\mathcal{B}}\) \(\newcommand{\cC}{\mathcal{C}}\) \(\newcommand{\cD}{\mathcal{D}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cF}{\mathcal{F}}\) \(\newcommand{\cG}{\mathcal{G}}\) \(\newcommand{\cH}{\mathcal{H}}\) \(\newcommand{\cI}{\mathcal{I}}\) \(\newcommand{\cJ}{\mathcal{J}}\) \(\newcommand{\cK}{\mathcal{K}}\) \(\newcommand{\cL}{\mathcal{L}}\) \(\newcommand{\cM}{\mathcal{M}}\) \(\newcommand{\cN}{\mathcal{N}}\) \(\newcommand{\cO}{\mathcal{O}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\cQ}{\mathcal{Q}}\) \(\newcommand{\cR}{\mathcal{R}}\) \(\newcommand{\cS}{\mathcal{S}}\) \(\newcommand{\cT}{\mathcal{T}}\) \(\newcommand{\cU}{\mathcal{U}}\) \(\newcommand{\cV}{\mathcal{V}}\) \(\newcommand{\cW}{\mathcal{W}}\) \(\newcommand{\cX}{\mathcal{X}}\) \(\newcommand{\cY}{\mathcal{Y}}\) \(\newcommand{\cZ}{\mathcal{Z}}\) \(\newcommand{\bA}{\mathbf{A}}\) \(\newcommand{\bB}{\mathbf{B}}\) \(\newcommand{\bC}{\mathbf{C}}\) \(\newcommand{\bD}{\mathbf{D}}\) \(\newcommand{\bE}{\mathbf{E}}\) \(\newcommand{\bF}{\mathbf{F}}\) \(\newcommand{\bG}{\mathbf{G}}\) \(\newcommand{\bH}{\mathbf{H}}\) \(\newcommand{\bI}{\mathbf{I}}\) \(\newcommand{\bJ}{\mathbf{J}}\) \(\newcommand{\bK}{\mathbf{K}}\) \(\newcommand{\bL}{\mathbf{L}}\) \(\newcommand{\bM}{\mathbf{M}}\) \(\newcommand{\bN}{\mathbf{N}}\) \(\newcommand{\bO}{\mathbf{O}}\) \(\newcommand{\bP}{\mathbf{P}}\) \(\newcommand{\bQ}{\mathbf{Q}}\) \(\newcommand{\bR}{\mathbf{R}}\) \(\newcommand{\bS}{\mathbf{S}}\) \(\newcommand{\bT}{\mathbf{T}}\) \(\newcommand{\bU}{\mathbf{U}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\ba}{\mathbf{a}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\bd}{\mathbf{d}}\) \(\newcommand{\be}{\mathbf{e}}\) \(\newcommand{\bg}{\mathbf{g}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bi}{\mathbf{i}}\) \(\newcommand{\bj}{\mathbf{j}}\) \(\newcommand{\bk}{\mathbf{k}}\) \(\newcommand{\bl}{\mathbf{l}}\) \(\newcommand{\bm}{\mathbf{m}}\) \(\newcommand{\bn}{\mathbf{n}}\) \(\newcommand{\bo}{\mathbf{o}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bq}{\mathbf{q}}\) \(\newcommand{\br}{\mathbf{r}}\) \(\newcommand{\bs}{\mathbf{s}}\) \(\newcommand{\bt}{\mathbf{t}}\) \(\newcommand{\bu}{\mathbf{u}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\RR}{\mathbb{R}}\) \(\newcommand{\NN}{\mathbb{N}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\hpi}{\widehat{\pi}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\hbpi}{\widehat{\boldsymbol{\pi}}}\) \(\newcommand{\bxi}{\boldsymbol{\xi}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bzero}{\mathbf{0}}\) \(\newcommand{\T}{\text{T}}\) \(\newcommand{\Trace}{\text{Trace}}\) \(\newcommand{\Cov}{\text{Cov}}\) \(\newcommand{\Var}{\text{Var}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\Pr}{\text{Pr}}\) \(\newcommand{\pr}{\text{pr}}\) \(\newcommand{\pdf}{\text{pdf}}\) \(\newcommand{\P}{\text{P}}\) \(\newcommand{\p}{\text{p}}\) \(\newcommand{\One}{\mathbf{1}}\) \(\newcommand{\argmin}{\operatorname*{arg\,min}}\) \(\newcommand{\argmax}{\operatorname*{arg\,max}}\) \(\newcommand{\dtheta}{\frac{\partial}{\partial\theta} }\) \(\newcommand{\ptheta}{\nabla_\theta}\) \(\newcommand{\alert}[1]{\color{darkorange}{#1}}\) \(\newcommand{\alertr}[1]{\color{red}{#1}}\) \(\newcommand{\alertb}[1]{\color{blue}{#1}}\)

1 Overview

In our previous discussion of dynamic treatment regimes (DTRs), we introduced the Q-learning algorithm for estimating an optimal sequence of treatment decisions. In this lecture, we turn to the theoretical properties of Q-learning and analyze how estimation errors affect the quality of the learned policy. The main reference for this part is Murphy (2005), which provides one of the earliest theoretical analyses of Q-learning with function approximation.
While earlier work often assumed a tabular setting (i.e., exact representation of the Q-function), Murphy’s analysis explicitly considers finite-sample estimation and shows that errors in the estimated Q-functions can accumulate backward through the stages of dynamic programming. The key takeaway is that the generalization error of the learned dynamic treatment rule depends on the mean-squared deviation between the estimated and true Q-functions, and that estimation errors at later stages can be amplified when propagated backward through time.

2 A Recap on Q-learning for DTR

We observe \(n\) independent patient trajectories: \[ \{ (S_{i,1}, A_{i,1}, R_{i,1}, S_{i,2}, A_{i,2}, R_{i,2}, \dots, S_{i,T}, A_{i,T}, R_{i,T}) \}_{i = 1}^n. \]

At each decision stage \(t = 1, \dots, T\): * \(S_{i,t}\) is the state or covariate history summary of individual \(i\); * \(A_{i,t}\) is the action or treatment assigned at stage \(t\); * \(R_{i,t}\) is the immediate reward received after that action.

In DTR, we do not make Markov assumption. In general, each quantity may depend on the full past history

\[ H_{i,t} = (\bar{S}_{i,t}, \bar{A}_{i,t-1}) = (S_{i,1}, A_{i,1}, \dots, S_{i,t}, A_{i,t-1}). \]

The goal is to estimate a sequence of decision rules

\[ \bpi = (\pi_1, \pi_2, \dots, \pi_T), \]

where each \(\pi_t\) maps the available history \(H_t\) to a treatment decision \(A_t = \pi_t(H_t)\). At each stage \(t\), define the Q-function

\[ Q_t(H_t, A_t) = \E[ R_t + V_{t+1}(H_{t+1}) \mid H_t, A_t ], \quad V_t(H_t) = \max_{a_t} Q_t(H_t, a_t), \]

where \(V_t(H_t)\) represents the value of following the optimal policy from stage \(t\) onward. Since \(Q_t\) is typically unknown, we approximate it within a functional class \(\mathcal{F}_t\) at each stage.

  • Starting with the final stage, \(t = T\)). Estimate \(Q_T^\ast\) by regressing the observed rewards \(R_T\) on the covariates \((H_T, A_T)\): \[ \widehat{Q}_T = \arg\min_{f \in \mathcal{F}_T} \frac{1}{n}\sum_{i=1}^n \big\{ R_{i,T} - f(H_{i,T}, A_{i,T}) \big\}^2, \] and define \[ \widehat{\pi}_T(h_T) = \arg\max_{a_T} \widehat{Q}_T(h_T, a_T). \]

  • For earlier stages, \(t = T-1, \dots, 1\)), first form the pseudo-outcome \[ Y_{i,t} = R_{i,t} + \max_{a_{t+1}} \widehat{Q}_{t+1}(H_{i,t+1}, a_{t+1}), \] and fit \[ \widehat{Q}_t = \arg\min_{f \in \mathcal{F}_t} \frac{1}{n}\sum_{i=1}^n \big\{ Y_{i,t} - f(H_{i,t}, A_{i,t}) \big\}^2. \] The estimated optimal rule at stage \(t\) is then \[ \widehat{\pi}_t(h_t) = \arg\max_{a_t} \widehat{Q}_t(h_t, a_t). \]

After completing all stages, the learned dynamic treatment regime is \[ \widehat{\bpi} = (\widehat{\pi}_1, \widehat{\pi}_2, \dots, \widehat{\pi}_T). \]

In the next sections, we will analyze how the estimation errors in the sequence \(\widehat{Q}_t\)’s lead to differences between \(\cV^{\widehat{\bpi}}\) and \(\cV^{\bpi^\ast}\), and how these errors propagate backward through the recursion, leading to the generalization error bound.

3 A Regret Bound of Any Policy

The estimated policy would be of high quality if we estimate the Q-function accurately. This would be a separate topic to discuss later. However, we may first ask a question: how does that perform compared with the optimal policy, especially when we must evaluate it based on the training data we collect? This can be formally analyzed by looking at the difference

\[ \cV^\ast(s_1) - \cV^\bpi(s_1) \]

for any \(\bpi\). In practice, we may think of this \(\bpi\) as the one that we estimated, e.g., \(\hat\bpi\), based on the training data. Following Murphy (2005), we can establish a bound on this difference as

\[ \cV^\ast(s_1) - \cV^\bpi(s_1) \leq \sum_{t = 1}^T 2\,L^{t/2} \sqrt{ \E_\mu \!\left[ \big( Q^\bpi_t(S_t, A_t) - Q^\ast_t(S_t, A_t) \big)^2 \;\big|\; S_1 = s_1 \right] }. \]

This result shows that the performance difference between an arbitrary policy \(\bpi\) and the optimal policy \(\pi^\ast\) depends on the aggregated version of how far their respective \(Q\)-functions differ at each stage. The error accumulates over time and can grow with the horizon \(T\) if \(Q^\bpi_t\) is not close to \(Q_t^\ast\). Before proving this, We clarify several quantities:

  • The expectation \(\E_\mu\) is with respect to the behavior policy \(\mu\) that generates the training data. Hence, this can be useful when our Q-function estimates are based on the data collected under \(\mu\).
  • \(L\) is a constant such that \(\mu(a_t \mid S_t) \ge L^{-1}\) for any \(a_t\), ensuring all actions have positive probability under the behavior policy.
  • \(Q_t^\ast\) is the optimal Q-function at stage \(t\), associated with the optimal policy \(\bpi^\ast\). Hence, at time point \(t\), \(\pi_t^\ast(s_t) = \arg\max_{a_t} Q_t^\ast(s_t, a_t)\).
  • \(Q_t^\bpi\) is the true Q-function associated with policy \(\bpi\) (i.e., without estimation error). For example, if \(\bpi = \hat\bpi\) is the estimated policy, \(Q_t^\bpi\) represents the corresponding population version of the estimated \(Q\)-function.

4 Telescoping Advantage Lemma

To prove the regret bound, let’s first establish a Telescoping Advantage lemma

\[ \cV^\ast(s_1) - \cV^\bpi(s_1) = - \E_\bpi \!\left[ \sum_{t=1}^T \delta^\ast_t(\,\bar{s}_t,\, \bar{a}_t\,) \,\bigg|\, S_1 = s_1 \right], \]

where \(\delta^\bpi_t(\,\bar{s}_t,\, a_t\,)\) is called the advantage function associated with any policy \(\bpi\), defined as

\[ \delta^\bpi_t(\bar{s}_t, a_t) = Q_t^\bpi(\,\bar{s}_t,\, \bar{a}_{t-1},\, a_t\,) - \cV_t^\bpi(\,\bar{s}_t,\, \bar{a}_{t-1}\,), \]

which measures how well choosing another action \(a_t\) (different from the one used by \(\pi_t\)) performs relative to the policy’s own value at stage \(t\). If the policy \(\bpi\) is derived from maximizing its corresponding \(Q\)-function, the advantage function will always be non-positive. This is because the corresponding value function \(\cV_t^\bpi(\bar{s}_t, \bar{a}_{t-1})\) is exactly the maximum of \(Q_t^\bpi(\bar{s}_t, \bar{a}_{t-1}, a_t)\) over all possible \(a_t\). Now we proceed with the proof. By double expectation, we have

\[ \cV^\bpi(s_1) = \E_\bpi \!\left[ \sum_{t=1}^T R_t \,\bigg|\, S_1 = s_1 \right] = \E_\bpi \!\left[ \E_\bpi \!\left[ \sum_{t=1}^T R_t \,\Big|\, \bar{s}_T, \bar{a}_T \right] \,\bigg|\, S_1 = s_1 \right]. \]

Here, we further condition on the random variables \(\bar{s}_T, \bar{a}_T\) within the inner expectation. Once \(\bar{s}_T, \bar{a}_T\) are given, the distribution of the remaining trajectory is fixed and no longer depends on the policy. This allows us to replace the inner expectation under \(\E_\bpi\) by that under any other policy. The only randomness that remains is in \(R_T\), which depends on \(S_{T+1}\) after the final action \(A_T\) has been taken.
Hence, we have

\[ \begin{aligned} \E_\bpi \!\left[ \sum_{t=1}^T R_t \,\Big|\, \bar{s}_T, \bar{a}_T \right] &= \E_{\bpi^\ast} \!\left[ \sum_{t=1}^T R_t \,\Big|\, \bar{s}_T, \bar{a}_T \right] \\ &= \sum_{t=1}^{T-1} R_t + \E \!\left[ R_T \,\Big|\, \bar{s}_T, \bar{a}_T \right] \\ &= \sum_{t=1}^{T-1} R_t + Q_T^\ast(\bar{s}_T, \bar{a}_T). \end{aligned} \]

Now, we add and subtract a sequence of terms based on the optimal policy \(\bpi^\ast\) to construct a telescoping sum:

\[ \begin{aligned} \E_\bpi \!\left[ \sum_{t=1}^T R_t \,\Big|\, \bar{s}_T, \bar{a}_T \right] &= \sum_{t = 1}^T \Big[ Q_t^\ast(\bar{s}_t, \bar{a}_t) - \cV_t^\ast(\bar{s}_t, \bar{a}_{t-1}) \Big] \\ & \quad + \sum_{t=1}^{T-1} \Big[ R_t + \cV_{t+1}^\ast(\bar{s}_{t+1}, \bar{a}_t) - Q_t^\ast(\bar{s}_t, \bar{a}_t) \Big] \\ & \quad + \cV_1^\ast(s_1). \end{aligned} \]

The first term corresponds to the advantage functions \(\delta_t^\ast(\bar{s}_t, \bar{a}_t)\) of the optimal policy. The second term equals zero by the definition of the Q-function. The last term is the value function \(\cV^\ast(s_1)\) of the optimal policy \(\bpi^\ast\). This establishes the telescoping advantage lemma.

5 Getting the Regret Bound

Noticing that the expectation in the telescoping advantage equation is with respect to the policy \(\bpi\), we assume \(\bpi\) acts greedily with respect to its own \(Q_t^\bpi\). That is, \(\pi_t(\bar{s}_t) = \arg\max_{a_t} Q_t^\bpi(\bar{s}_t, \bar{a}_{t-1}, a_t)\). Under this policy, if we take the action \(a_t = \pi_t(\bar{s}_t)\) at stage \(t\), then the advantage function is exactly 0: \[ \begin{aligned} \delta^\bpi_t(\bar{s}_t, a_t) &= Q^\bpi_t(\bar{s}_t, \bar{a}_{t-1}, a_t=\pi_t(\bar{s}_t)) - \max_{a_t} Q^\bpi_t(\bar{s}_t, \bar{a}_{t-1}, a_t) \\ &= 0. \end{aligned} \]

Using this fact, we can rewrite the telescoping advantage equation as \[ \cV^\ast(s_1) - \cV^\bpi(s_1) = \sum_{t=1}^T \E_\bpi \!\left[ \delta^\bpi_t(\bar{S}_t, \bar{A}_t) - \delta^\ast_t(\bar{S}_t, \bar{A}_t) \,\bigg|\, S_1 = s_1 \right]. \]

We can now bound each term in the sum (note that the outer expectation is w.r.t. \(\bpi\)): \[ \begin{aligned} & \,\,\big|\delta^\bpi_t(\bar{S}_t, \bar{A}_t) - \delta^\ast_t(\bar{S}_t, \bar{A}_t)\big| \\ =& \,\, Q^\bpi_t(\bar{S}_t, \bar{A}_{t-1}, A_t=\pi_t(\bar{S}_t)) - \max_{a_t} Q^\bpi_t(\bar{S}_t, \bar{A}_{t-1}, a_t) \\ &\quad - \, Q^\ast_t(\bar{S}_t, \bar{A}_{t-1}, A_t=\pi_t(\bar{S}_t)) + \max_{a_t} Q^\ast_t(\bar{S}_t, \bar{A}_{t-1}, a_t) \\ \leq& \,\, \big| Q^\bpi_t(\bar{S}_t, \bar{A}_{t-1}, A_t=\pi_t(\bar{S}_t)) - Q^\ast_t(\bar{S}_t, \bar{A}_{t-1}, A_t=\pi_t(\bar{S}_t)) \big| \\ &\quad + \max_{a_t} \big| Q^\bpi_t(\bar{S}_t, \bar{A}_{t-1}, a_t) - Q^\ast_t(\bar{S}_t, \bar{A}_{t-1}, a_t) \big| \\ \leq& \,\, 2 \max_{a_t} \big| Q^\bpi_t(\bar{S}_t, \bar{A}_{t-1}, a_t) - Q^\ast_t(\bar{S}_t, \bar{A}_{t-1}, a_t) \big| \\ \leq& \,\, 2 L \sum_{a_t} \big| Q^\bpi_t(\bar{S}_t, \bar{A}_{t-1}, a_t) - Q^\ast_t(\bar{S}_t, \bar{A}_{t-1}, a_t) \big| \, \mu_t(a_t \mid \bar{S}_t, \bar{A}_{t-1}) \\ =& \,\, 2 L \, \E_{A_t \sim \mu_t(\cdot \mid \bar{S}_t, \bar{A}_{t-1})} \!\left[ \big| Q^\bpi_t(\bar{S}_t, \bar{A}_{t-1}, A_t) - Q^\ast_t(\bar{S}_t, \bar{A}_{t-1}, A_t) \big| \right], \end{aligned} \] where the last two steps use the positivity condition \(\mu_t(a_t \mid \bar{S}_t, \bar{A}_{t-1}) \ge L^{-1}\).

The last step is to take care of the rest of the expectation, i.e., \(A_1\) to \(A_{t-1}\), which still follows the policy \(\bpi\). Comparing the density functions defined using the behavior policy \(\bmu\) and the target (estimated) policy \(\bpi\), the key idea is to use the Radon–Nikodym derivative again, same as our lecture on the outcome weighted learning. Hence, we have

\[ \begin{aligned} & \,\, \E_\bpi \!\left[ \delta^\bpi_t(\bar{S}_t, \bar{A}_t) - \delta^\ast_t(\bar{S}_t, \bar{A}_t) \,\bigg|\, S_1 = s_1 \right] \\ =& \,\, 2L \, \E_{A_1, \ldots, A_{t-1} \sim \bpi} \bigg[ \E_{A_t \sim \bmu(\cdot \mid \bar{S}_t, \bar{A}_{t-1})} \Big[ \big| Q^\bpi_t(\bar{S}_t, \bar{A}_t) - Q^\ast_t(\bar{S}_t, \bar{A}_t) \big| \Big] \,\bigg|\, S_1 = s_1 \bigg] \\ =& \,\, 2L \, \E_\bmu \left[ \prod_{l = 1}^{t-1} \frac{\mathbf{1}\{A_l = \pi_l(\bar{S}_l)\}}{\mu_l(A_l \mid \bar{S}_l, \bar{A}_{l-1})} \, \big| Q^\bpi_t(\bar{S}_t, \bar{A}_{t-1}, A_t) - Q^\ast_t(\bar{S}_t, \bar{A}_{t-1}, A_t) \big| \, \bigg| S_1 = s_1 \right] \\ =& \,\, 2L \, \E_\bmu \left[ L^{\,t-1} \, \big| Q^\bpi_t(\bar{S}_t, \bar{A}_{t-1}, A_t) - Q^\ast_t(\bar{S}_t, \bar{A}_{t-1}, A_t) \big| \, \bigg| S_1 = s_1 \right] \\ =& \,\, 2 L^{\,t} \, \E_\bmu \left[ \big| Q^\bpi_t - Q^\ast_t \big| \, \bigg| S_1 = s_1 \right]. \end{aligned} \]

This implies that

\[ \cV^\ast(s_1) - \cV^\bpi(s_1) \leq \sum_{t = 1}^T 2\,L^{\,t} \, \E_\bmu \left[ \big| Q^\bpi_t - Q^\ast_t \big| \, \bigg| S_1 = s_1 \right]. \]

If we instead use the Cauchy–Schwarz inequality, we have (with a bit of abuse of notation)

\[ \begin{aligned} & \,\, \sum_{t = 1}^T \E_\bpi \bigg[ \max_{a_t} \big| Q^\bpi_t(a_t) - Q^\ast_t(a_t) \big| \,\bigg|\, S_1 = s_1 \bigg] \\ \leq& \,\, \sum_{t = 1}^T \sqrt{ \E_\bpi \bigg[ \max_{a_t} \big( Q^\bpi_t(a_t) - Q^\ast_t(a_t) \big)^2 \,\bigg|\, S_1 = s_1 \bigg] } \\ \leq& \,\, \sum_{t = 1}^T \sqrt{ \E_\bmu \bigg[ L^{\,t} \big( Q^\bpi_t - Q^\ast_t \big)^2 \,\bigg|\, S_1 = s_1 \bigg] } \\ =& \,\, \sum_{t = 1}^T L^{\,t/2} \sqrt{ \E_\bmu \bigg[ \big( Q^\bpi_t - Q^\ast_t \big)^2 \,\bigg|\, S_1 = s_1 \bigg] }. \end{aligned} \]

This completes the proof.

6 From Regret Bound to Generalization Error

The regret bound we established provides a population-level relationship between the performance gap \(\cV^\ast(s_1) - \cV^\pi(s_1)\) and the mean-squared differences of the corresponding \(Q\)-functions under the behavior distribution. Murphy (2005) extends this result to a finite-sample generalization bound by recognizing that each estimated \(\widehat{Q}_t\) is fitted using pseudo-outcomes \[ Y_{i,t} = R_{i,t} + \max_{a_{t+1}} \widehat{Q}_{t+1}(S_{i,t+1}, a_{t+1}), \] which depend on the estimated \(\widehat{Q}_{t+1}\) from the next stage. As a result, the estimation errors at later stages propagate backward even the estimation at each stage is relatively accurate.

To translate the regret bound into a generalization bound, we can replace the population expectations in \(\E_\mu[(Q_t^\pi - Q_t^\ast)^2]\) with their empirical counterparts based on the training data, and we may further decomposes each term into the estimation error \(E_t\) and approximation error \(A_t\). This yields a bound of the form \[ \cV^\ast(s_1) - \cV^{\widehat{\bpi}}(s_1) \;\lesssim\; \sum_{t=1}^T L^{t/2} \sqrt{E_t + A_t}, \] showing that both statistical and modeling errors at each stage contribute to the generalization error, with later-stage errors being amplified due to the recursive dependence of the pseudo-outcomes.


Murphy, Susan A. 2005. “A Generalization Error for q-Learning.” The Journal of Machine Learning Research 6: 1073–97.