\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\cA}{\mathcal{A}}\) \(\newcommand{\cB}{\mathcal{B}}\) \(\newcommand{\cC}{\mathcal{C}}\) \(\newcommand{\cD}{\mathcal{D}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cF}{\mathcal{F}}\) \(\newcommand{\cG}{\mathcal{G}}\) \(\newcommand{\cH}{\mathcal{H}}\) \(\newcommand{\cI}{\mathcal{I}}\) \(\newcommand{\cJ}{\mathcal{J}}\) \(\newcommand{\cK}{\mathcal{K}}\) \(\newcommand{\cL}{\mathcal{L}}\) \(\newcommand{\cM}{\mathcal{M}}\) \(\newcommand{\cN}{\mathcal{N}}\) \(\newcommand{\cO}{\mathcal{O}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\cQ}{\mathcal{Q}}\) \(\newcommand{\cR}{\mathcal{R}}\) \(\newcommand{\cS}{\mathcal{S}}\) \(\newcommand{\cT}{\mathcal{T}}\) \(\newcommand{\cU}{\mathcal{U}}\) \(\newcommand{\cV}{\mathcal{V}}\) \(\newcommand{\cW}{\mathcal{W}}\) \(\newcommand{\cX}{\mathcal{X}}\) \(\newcommand{\cY}{\mathcal{Y}}\) \(\newcommand{\cZ}{\mathcal{Z}}\) \(\newcommand{\bA}{\mathbf{A}}\) \(\newcommand{\bB}{\mathbf{B}}\) \(\newcommand{\bC}{\mathbf{C}}\) \(\newcommand{\bD}{\mathbf{D}}\) \(\newcommand{\bE}{\mathbf{E}}\) \(\newcommand{\bF}{\mathbf{F}}\) \(\newcommand{\bG}{\mathbf{G}}\) \(\newcommand{\bH}{\mathbf{H}}\) \(\newcommand{\bI}{\mathbf{I}}\) \(\newcommand{\bJ}{\mathbf{J}}\) \(\newcommand{\bK}{\mathbf{K}}\) \(\newcommand{\bL}{\mathbf{L}}\) \(\newcommand{\bM}{\mathbf{M}}\) \(\newcommand{\bN}{\mathbf{N}}\) \(\newcommand{\bO}{\mathbf{O}}\) \(\newcommand{\bP}{\mathbf{P}}\) \(\newcommand{\bQ}{\mathbf{Q}}\) \(\newcommand{\bR}{\mathbf{R}}\) \(\newcommand{\bS}{\mathbf{S}}\) \(\newcommand{\bT}{\mathbf{T}}\) \(\newcommand{\bU}{\mathbf{U}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\ba}{\mathbf{a}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\bd}{\mathbf{d}}\) \(\newcommand{\be}{\mathbf{e}}\) \(\newcommand{\bg}{\mathbf{g}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bi}{\mathbf{i}}\) \(\newcommand{\bj}{\mathbf{j}}\) \(\newcommand{\bk}{\mathbf{k}}\) \(\newcommand{\bl}{\mathbf{l}}\) \(\newcommand{\bm}{\mathbf{m}}\) \(\newcommand{\bn}{\mathbf{n}}\) \(\newcommand{\bo}{\mathbf{o}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bq}{\mathbf{q}}\) \(\newcommand{\br}{\mathbf{r}}\) \(\newcommand{\bs}{\mathbf{s}}\) \(\newcommand{\bt}{\mathbf{t}}\) \(\newcommand{\bu}{\mathbf{u}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\RR}{\mathbb{R}}\) \(\newcommand{\NN}{\mathbb{N}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\hpi}{\widehat{\pi}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\hbpi}{\widehat{\boldsymbol{\pi}}}\) \(\newcommand{\bxi}{\boldsymbol{\xi}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bzero}{\mathbf{0}}\) \(\newcommand{\T}{\text{T}}\) \(\newcommand{\Trace}{\text{Trace}}\) \(\newcommand{\Cov}{\text{Cov}}\) \(\newcommand{\Var}{\text{Var}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\Pr}{\text{Pr}}\) \(\newcommand{\pr}{\text{pr}}\) \(\newcommand{\pdf}{\text{pdf}}\) \(\newcommand{\P}{\text{P}}\) \(\newcommand{\p}{\text{p}}\) \(\newcommand{\One}{\mathbf{1}}\) \(\newcommand{\argmin}{\operatorname*{arg\,min}}\) \(\newcommand{\argmax}{\operatorname*{arg\,max}}\) \(\newcommand{\dtheta}{\frac{\partial}{\partial\theta} }\) \(\newcommand{\ptheta}{\nabla_\theta}\) \(\newcommand{\alert}[1]{\color{darkorange}{#1}}\) \(\newcommand{\alertr}[1]{\color{red}{#1}}\) \(\newcommand{\alertb}[1]{\color{blue}{#1}}\)

1 Overview

Our previous lecture introduced the Markov Decision Process (MDP) and the Bellman Equation. There, we evaluated a given policy by solving the linear system (Sutton and Barto 2018)

\[ \bv^\pi = \br^\pi + \gamma \bP^\pi \bv^\pi, \tag{1} \]

where \(\bv^\pi\) is the vector of value-function evaluations over all states, \(\br^\pi\) is the vector of expected rewards, \(\bP^\pi\) is the transition matrix under policy \(\pi\), and \(\gamma \in [0, 1)\) is the discount factor. Although this equation has a closed-form solution, computing or estimating all components can be difficult in practice, especially when the state or action space is large.

In this lecture, we instead view the Bellman Equation as a fixed-point equation for an operator acting on value functions. We will introduce the Bellman operator, verify that it is a contraction under the supremum norm, and show that repeated application of this operator converges to the true value function. This provides the foundation for iterative policy evaluation.

2 Bellman Operator

For clarity, throughout this lecture we focus on a deterministic policy. That is, the action at state \(s\) is given by \(a = \pi(s)\) with no randomness. All arguments extend to the stochastic case, but the deterministic setting avoids a further integration. Under a deterministic policy, the value function satisfies

\[ \begin{aligned} V^\pi(s) &= \E_\pi \left[ R + \gamma\, V^\pi(S') \mid S = s \right] \\ &= r(s, \pi(s)) + \gamma \E_{S' \sim \P(\cdot \mid s, \pi(s))} \left[ V^\pi(S') \right], \end{aligned} \]

where \(r(s,a)\) is the expected immediate reward and \(\P(s' \mid s,a)\) is the transition density or probability mass function. This motivates defining the Bellman operator \(\cT^\pi\), which maps any candidate value function \(V\) (a finite or infinite dimensional vector) to a new function:

\[ \cT^\pi V(s) = r(s, \pi(s)) + \gamma \E_{S' \sim \P(\cdot \mid s, \pi(s))} \left[ V(S') \right]. \tag{2} \]

The true value function \(V^\pi\) is the unique fixed point of this operator, meaning that

\[ \cT^\pi V^\pi = V^\pi. \]

We will show in the following that \(\cT^\pi\) is a contraction mapping on the space of bounded functions under the supremum norm. This contraction property implies not only that the fixed point exists and is unique, but also that iteratively applying the operator,

\[ V_\text{new} \;\leftarrow\; \cT^\pi V_\text{old}, \]

converges to \(V^\pi\) regardless of the starting function. This provides a practical method for policy evaluation when \(\bP^\pi\) or \(\br^\pi\) cannot be computed directly.

3 Contraction Mapping

Before showing that the Bellman operator is a contraction, we review a classical result in functional analysis, the Banach Fixed-point Theorem.

(Banach Fixed-point Theorem) Let \(\cX\) be a complete metric space with distance metric \(d\), and let \(\cT : \cX \to \cX\) be a contraction mapping. That is, there exists a constant \(0 \le \gamma < 1\) such that \[ d(\cT(x_1), \cT(x_2)) \le \gamma \, d(x_1, x_2) \quad \text{for all } x_1, x_2 \in \cX. \] Then \(\cT\) has a unique fixed point \(x^\ast\), and the sequence \((\cT^n x_0)\) converges to \(x^\ast\) for any starting point \(x_0\).

We now verify that the Bellman operator \(\cT^\pi\) is a contraction mapping on the space of bounded value functions equipped with the supremum norm. Recall our definition in (2), consider any two value functions \(V_1\) and \(V_2\). Then

\[ \begin{aligned} \| \cT^\pi V_1 - \cT^\pi V_2 \|_\infty &= \sup_s \left| \gamma \E_{S' \sim \P(\cdot\mid s,\pi(s))} \left[ V_1(S') - V_2(S') \right] \right| \\ &\le \gamma \sup_s \E_{S'} \left[ \left| V_1(S') - V_2(S') \right| \right] \\ &\le \gamma \sup_{s'} \left| V_1(s') - V_2(s') \right| \\ &= \gamma \, \|V_1 - V_2\|_\infty. \end{aligned} \]

The second inequality uses the elementary fact that the expectation of a bounded function is bounded above by its supremum. This argument applies equally in the discrete case (using sums) and the continuous case (using integrals). Thus, \(\cT^\pi\) is a \(\gamma\)-contraction, and Banach’s theorem guarantees the existence and uniqueness of the fixed point \(V^\pi\), as well as convergence of the iteration

\[ V_\text{new} \leftarrow \cT^\pi V_\text{old} \]

to \(V^\pi\) for any initial function \(V\). This result applies to both finite and infinite state spaces. When the state space is finite, the fixed point can also be obtained by solving the linear system in (1). However, when the space is large or continuous, it is often infeasible to compute \(\bP^\pi\) directly. Hence, we may want to empirically estimate the Bellman operator using samples, and then iteratively apply it to approximate \(V^\pi\). We can check that when applying the Bellman operator repeatedly we can reach the limit. Starting from any \(V\), if we apply the Bellman equation twice, we get

\[ \begin{aligned} (\cT^\pi)^2 V &= \br^\pi + \gamma \bP^\pi (\br^\pi + \gamma \bP^\pi V) \\ &= \br^\pi + \gamma \bP^\pi \br^\pi + \gamma^2 (\bP^\pi)^2 V. \end{aligned} \]

A third application yields

\[ (\cT^\pi)^3 V = \br^\pi + \gamma \bP^\pi \br^\pi + \gamma^2 (\bP^\pi)^2 \br^\pi + \gamma^3 (\bP^\pi)^3 V. \]

We can now see a clear pattern: each application introduces an additional term \(\gamma^k (\bP^\pi)^k \br^\pi\), and the influence of the initial \(V\) is pushed forward by another factor of \(\gamma \bP^\pi\). A simple induction argument verifies that for any integer \(n \ge 1\),

\[ (\cT^\pi)^n V = \sum_{k=0}^{n-1} \gamma^k (\bP^\pi)^k \br^\pi + \gamma^n (\bP^\pi)^n V. \]

The last term represents the residual contribution of the initial function \(V\) after \(n\) iterations. Because \(\gamma \in [0,1)\) and \(\bP^\pi\) is a stochastic matrix1, this remainder term vanishes as \(n \to \infty\). Hence the sequence \((\cT^\pi)^n V\) converges to the infinite discounted sum

\[ \sum_{k=0}^{\infty} \gamma^k (\bP^\pi)^k \br^\pi. \]

Noticing that

\[ (\bI - \gamma \bP^\pi) \sum_{k=0}^{\infty} \gamma^k (\bP^\pi)^k = \bI, \]

Therefore,

\[ \lim_{n \to \infty} (\cT^\pi)^n V = (\bI - \gamma \bP^\pi)^{-1} \br^\pi = \bv^\pi. \]

Which is exactly the same as the one we obtained from solving the linear system in (1). This convergence can also be seen directly from the contraction property, by applying it repeatedly:

\[ \| (\cT^\pi)^n V_1 - (\cT^\pi)^n V_2 \|_\infty \le \gamma^n \|V_1 - V_2\|_\infty \longrightarrow 0, \quad n \to \infty. \]

3.1 Stochastic Policies

The above discussion can be easily extended to stochastic policies. In this case, the Bellman operator becomes a linear combination over all possible actions, weighted by the policy probabilities:

\[ \cT^\pi V(s) = \sum_a \pi(a \mid s) \left[ r(s,a) + \gamma \E_{S' \sim \P(\cdot \mid s,a)} \left[ V(S') \right] \right]. \]

The contraction property can be verified similarly since the expectation over actions is simply a weighted average and it would not violate the infinity norm bound. Thus, all results for deterministic policies carry over to stochastic policies without change.

3.2 Continuous State and Action Spaces

When the state and action spaces are continuous, the Bellman operator is defined using integrals instead of sums. For example, if we assume a density function \(p(s' \mid s,a)\) for the transition kernel and a density \(\pi(a \mid s)\) for the stochastic policy, then

\[ \cT^\pi V(s) = \int_a \pi(a \mid s) \left[ r(s,a) + \gamma \int_{s'} p(s' \mid s,a) V(s') ds' \right] da. \]

The contraction property still holds because the integrals are linear operators and the same supremum norm bounds apply.

4 Example: Iterative Policy Evaluation

We can do a simple example to illustrate the iterative policy evaluation using the Bellman operator. Consider a simple MDP with 3 states and 2 actions. The transition probability matrices for action 1 and action 2 are given by

\[ P_1 = \begin{pmatrix} 0.8 & 0.1 & 0.1 \\ 0.05 & 0.05 & 0.9 \\ 0.2 & 0.2 & 0.6 \end{pmatrix} \]

\[ P_2 = \begin{pmatrix} 0.5 & 0.25 & 0.25 \\ 0.1 & 0.8 & 0.1 \\ 0.8 & 0.1 & 0.1 \end{pmatrix} \]

Our immediate reward function is

\[ r = \begin{pmatrix} 5 & 3 \\ 2 & 2.5 \\ 3 & 2 \end{pmatrix} \]

The following code performs iterative policy evaluation for a given stochastic policy:


  # Transition matrices and reward matrix as before
  P1 = matrix(c(0.8, 0.1, 0.1,
                0.05, 0.05, 0.9,
                0.2, 0.2, 0.6), byrow=TRUE, nrow=3)

  P2 = matrix(c(0.5, 0.25, 0.25,
                0.1, 0.8, 0.1,
                0.8, 0.1, 0.1), byrow=TRUE, nrow=3)

  # define the expected reward for each state and each action
  r = matrix(c(5, 3,
               2, 2.5,
               3, 2), byrow=TRUE, nrow=3)

  gamma = 0.7

  # Define a stochastic policy: each row is a state, 
  # Columns represent the probability of taking action 1 and action 2 respectively
  Pi = matrix(c(0.8, 0.2,
                0.3, 0.7,
                0.7, 0.3), byrow = TRUE, nrow = 3)

  # Induced reward vector r^pi(s) = sum_a pi(a|s) r(s,a)
  r_pi = rowSums(r * Pi)

  # Induced transition matrix P^pi(s, s') = sum_a pi(a|s) P_a(s, s')
  P_pi = matrix(0, nrow = 3, ncol = 3)
  for (s in 1:3) {
    P_pi[s, ] = Pi[s, 1] * P1[s, ] + Pi[s, 2] * P2[s, ]
  }

  # Iterative policy evaluation: V_{k+1} = r^pi + gamma * P^pi V_k
  V = c(0, 0, 0)   # initial value function
  niter = 100

  for (k in 1:niter)
  {
    V_new = as.vector(r_pi + gamma * P_pi %*% V)

    if (k <= 6 | k == niter)
    {
      if (k == niter) cat("\n--------------\n\n")
      cat(paste("Iteration", k, "\n"))
      print(V_new)
    }

    V = V_new
  }
## Iteration 1 
## [1] 4.60 2.35 2.70
## Iteration 2 
## [1] 7.442350 4.212175 5.053750
## Iteration 3 
## [1] 9.298336 5.691013 6.772845
## Iteration 4 
## [1] 10.550749  6.805821  7.984034
## Iteration 5 
## [1] 11.411165  7.617313  8.831363
## Iteration 6 
## [1] 12.007813  8.196797  9.423709
## 
## --------------
## 
## Iteration 100 
## [1] 13.390040  9.569872 10.803745

  # Exact solution (linear system)
  cat("\nExact solution:\n")
## 
## Exact solution:
  V_exact = solve(diag(3) - gamma * P_pi, r_pi)
  print(V_exact)
## [1] 13.390040  9.569872 10.803745

5 Bellman Optimality Equation and Value Iteration

So far we have focused on evaluating a given policy (policy evaluation). The next step in the MDP framework is to identify a policy that maximizes long-term return (policy learning). Recall that from our earlier discussion on Q-learning, rolling out the optimal policy \(\pi^\ast\) yields the highest possible value at every state:

\[ V^\ast(s) = \sup_{\pi} \, V^\pi(s) = \E\!\left[ R + \gamma V^\ast(S') \,\big|\, S = s, A = \pi^\ast(s) \right]. \]

Since the optimal policy selects the action that maximizes the continuation value, this leads naturally to the Bellman Optimality Equation:

\[ V^\ast(s) = \max_{a} \left\{ r(s,a) + \gamma \E_{S' \sim p(\cdot \mid s,a)} \left[ V^\ast(S') \right] \right\}. \]

This equation defines the Bellman Optimality Operator, denoted \(\cT^\ast\), which acts on any candidate value function \(V\) as

\[ \cT^\ast V(s) = \max_{a} \left\{ r(s,a) + \gamma \E_{S' \sim p(\cdot \mid s,a)} \left[ V(S') \right] \right\}. \]

The true optimal value function \(V^\ast\) is the unique fixed point of this operator: \[ \cT^\ast V^\ast = V^\ast. \]

And natuarally, the optimal policy can be extracted as the greedy policy with respect to \(V^\ast\):

\[ \pi^\ast(s) = \arg\max_a \left\{ r(s,a) + \gamma \E_{S' \sim p(\cdot \mid s,a)} \left[ V^\ast(S') \right] \right\}. \]

We can also check that, although \(\cT^\ast\) involves a maximization, it retains the same contraction behavior as the policy-specific Bellman operator. For any two bounded value functions \(V_1\) and \(V_2\)

\[ \begin{aligned} \big\| \cT^\ast V_1 - \cT^\ast V_2 \big\|_\infty &= \sup_s \left| \max_a \left\{ r(s,a) + \gamma \E[V_1(S') \mid s,a] \right\} - \max_a \left\{ r(s,a) + \gamma \E[V_2(S') \mid s,a] \right\} \right| \\ &\le \sup_s \max_a \left| \gamma \E[V_1(S') - V_2(S') \mid s,a] \right| \\ &= \gamma \sup_s \max_a \int_{s'} | V_1(s') - V_2(s') | \,\, p(s' \mid s,a) ds' \\ &\le \gamma \sup_s \max_a \int_{s'} \| V_1 - V_2 \|_\infty p(s' \mid s,a) ds' \\ &= \gamma \| V_1 - V_2 \|_\infty. \end{aligned} \]

where the first inequality uses the fact that the difference of two maxima is bounded by the maximum of their differences2, and the subsequent steps uses standard supremum norm bounds. Thus \(\cT^\ast\) is a \(\gamma\)-contraction under the supremum norm. By Banach’s Fixed-point Theorem, the fixed point \(V^\ast\) exists, is unique, and can be obtained by repeatedly applying \(\cT^\ast\):

\[ V_\text{new} = \cT^\ast V_\text{old}, \]

Regardless of the initial guess \(V_0\), the sequence \((V_k)\) converges to the unique fixed point \(V^\ast\), with the optimal policy extracted as the greedy policy with respect to \(V^\ast\), defined previously. This iterative procedure to find the optimal value function is known as Value Iteration.

6 Example: Value Iteration

In the following, we illustrate how to apply the Bellman Optimality operator to the example MDP developed earlier. Since the transition probabilities and expected rewards are assumed known, this setting allows us to perform exact value iteration and observe the emergence of the optimal policy. We use the same transition matrices and reward structure as before. The example was designed such that

  • Stage 1 is will give the highest reward, picking action 1 will most likely to stay in the same state
  • Stage 2 has the lowest reward, one may pick action 2 to get a higher immediate reward, but it will most likely to stay in stage 2. Or one can pick action 1 to transit into stage 3, which can eventually transit into stage 1 to get higher reward.
  • Stage 3 also has a lower reward than stage 1. Action 1 will get a slightly higher immediate reward, but choosing action 2 will most likely to transit into stage 1, which can give larger reward in the long run.
  • Overall, in both stage 2 and stage 3, we would prefer to pick an action that gives a lower immediate reward but allows us to transit into stage 1 faster.
  # Define the transition probability matrix and rewards 

  P1 = matrix(c(0.8, 0.1, 0.1,
                0.05, 0.05, 0.9,
                0.2, 0.2, 0.6), byrow=TRUE, nrow=3)

  P2 = matrix(c(0.5, 0.25, 0.25,
                0.1, 0.8, 0.1,
                0.8, 0.1, 0.1), byrow=TRUE, nrow=3)

  # define the expected reward for each state and each action
  r = matrix(c(5, 3,
               2, 2.5,
               3, 2), byrow=TRUE, nrow=3)
  
  gamma = 0.7
  
  # start with some arbitrary value function
  V = c(0, 0, 0)

  # recursively update the value function using the Bellman Optimality operator
  # while rolling out the policy
  niter = 20
  for (k in 1:niter)
  {
    # Calculate the Q function, for each state and each action
    Q = matrix(NA, nrow=3, ncol=2)
    
    Q[, 1] = r[, 1] + gamma * P1 %*% V
    Q[, 2] = r[, 2] + gamma * P2 %*% V
    
    # update the value function using the bellman optimality operator
    V = apply(Q, 1, max)
    
    SummaryMat = cbind("Policy" = apply(Q, 1, which.max),
                       "Value" = V)
    
    if (k <= 4 | k == niter)
    {
      if (k == niter) cat("\n--------------\n\n")
      cat(paste("Iteration", k, "\n"))
      print(SummaryMat)
    }
  }
## Iteration 1 
##      Policy Value
## [1,]      1   5.0
## [2,]      2   2.5
## [3,]      1   3.0
## Iteration 2 
##      Policy Value
## [1,]      1 8.185
## [2,]      2 4.460
## [3,]      1 5.310
## Iteration 3 
##      Policy    Value
## [1,]      1 10.26750
## [2,]      2  5.94225
## [3,]      2  7.26750
## Iteration 4 
##      Policy     Value
## [1,]      1 11.674482
## [2,]      1  7.145866
## [3,]      2  8.674482
## 
## --------------
## 
## Iteration 20 
##      Policy    Value
## [1,]      1 14.90083
## [2,]      1 10.37910
## [3,]      2 11.90083

This is clearly better than the previously defined stochastic policy, in terms of the value function. In fact, after just four iterations, we have already figured out the optimal policy, but it will take a bit longer to figure out the optimal value function. We can validate the policy value using Monte Carlo simulation:

  # write a function to simulate the MDP
  simulate_MDP <- function(n, policy, P1, P2, r, gamma, s0){
    # prepare the data
    S = rep(NA, n)
    A = rep(NA, n)
    R = rep(NA, n)
    
    # lets start from the first state
    S[1] = s0
    
    # sequential simulation
    for (t in 1:n){
      
      # take an action based on the policy A_t
      A[t] = sample(1:2, 1, prob = policy[S[t],])
      
      # get the reward r(S_t, A_t) with a bit of noise
      R[t] = r[S[t], A[t]] + rnorm(1)*0.1
      
      if (t < n){
        # get the next state depending on the action
        if (A[t] == 1){
          S[t+1] = sample(1:3, 1, prob = P1[S[t],])
        } else {
          S[t+1] = sample(1:3, 1, prob = P2[S[t],])
        }
      }
    }
  
    #record the cumulative discounted reward
    return(list(S = S, A = A, R = R, cum_reward = sum(gamma^seq(0, n-1) %*% R)))
  }

  # Simulate the data
  # generate 100 simulations
  nsim = 1000
  cum_reward = matrix(NA, nsim, 3)
  set.seed(1)
  
  # the best policy is to pick action 1 all the time
  mypolicy = matrix(c(1, 0,
                      1, 0,
                      0, 1), byrow=TRUE, nrow=3)
  
  nchain = 100
  
  for (k in 1:nsim)
  {
    #record the cumulative discounted reward
    cum_reward[k, 1] = simulate_MDP(nchain, mypolicy, P1, P2, r, gamma, 1)$cum_reward
    cum_reward[k, 2] = simulate_MDP(nchain, mypolicy, P1, P2, r, gamma, 2)$cum_reward
    cum_reward[k, 3] = simulate_MDP(nchain, mypolicy, P1, P2, r, gamma, 3)$cum_reward
  }
  
  # simulated mean value function
  colMeans(cum_reward)
## [1] 14.79509 10.41573 11.87744

7 Bellman Optimality Operator for Q-functions

Similarly to the value function version, we can also derive the Bellman operator and the Bellman optimality operator for the Q-function. Recall our Bellman equation of the Q function, we can correspondingly define the Bellman operator under a fixed policy \(\pi\) as:

\[ \cT_Q^\pi Q(s,a) = r(s,a) + \gamma \E_{S' \sim p(\cdot \mid s,a)} \left[ \sum_{a'} \pi(a' \mid S') Q(S', a') \right]. \]

Under the supremum norm defined as \(\| Q \|_\infty = \sup_{s,a} |Q(s,a)|\), its easy to see that this operator is also a contraction mapping:

\[ \begin{aligned} &\big| (\cT_Q^\pi Q_1)(s,a) - (\cT_Q^\pi Q_2)(s,a) \big| \\ =& \, \gamma \, \left| \E\!\left[ Q_1(S',A') - Q_2(S',A') \,\big|\, S=s, A=a \right] \right| \\ \le& \, \gamma \, \E\!\left[ | Q_1(S',A') - Q_2(S',A') | \,\big|\, S=s, A=a \right] \\ \le& \, \gamma \, \E\!\left[ \| Q_1 - Q_2 \|_\infty \,\big|\, S=s, A=a \right] \\ =& \, \gamma \, \| Q_1 - Q_2 \|_\infty. \end{aligned} \]

Taking the supremum over \((s,a)\), we have

\[ \| \cT_Q^\pi Q_1 - \cT_Q^\pi Q_2 \|_\infty \le \gamma \| Q_1 - Q_2 \|_\infty. \]

Analogous to the value-function case, define the optimal Bellman operator on Q-functions as

\[ \cT_Q^\ast Q(s,a) = r(s,a) + \gamma \E_{S' \sim p(\cdot \mid s,a)} \big[ \max_{a'} Q(S', a') \big], \]

with the optimal policy extracted as the greedy policy with respect to \(Q^\ast\):

\[ \pi^\ast(s) \in \arg\max_a Q^\ast(s,a). \]

The contraction can be easily validated with

\[ \begin{aligned} &\big| (\cT_Q^\ast Q_1)(s,a) - (\cT_Q^\ast Q_2)(s,a) \big| \\ =& \, \gamma \, \left| \E\!\left[ \max_{a'} Q_1(S',a') - \max_{a'} Q_2(S',a') \,\big|\, S=s, A=a \right] \right| \\ \le& \, \gamma \, \E\!\left[ \big| \max_{a'} Q_1(S',a') - \max_{a'} Q_2(S',a') \big| \,\big|\, S=s, A=a \right] \\ \le& \, \gamma \, \E\!\left[ \max_{a'} | Q_1(S',a') - Q_2(S',a') | \,\big|\, S=s, A=a \right] \\ \le& \, \gamma \, \sup_{s',a'} | Q_1(s',a') - Q_2(s',a') | \\ =& \, \gamma \, \| Q_1 - Q_2 \|_\infty. \end{aligned} \]

Therefore

\[ \| \cT_Q^\ast Q_1 - \cT_Q^\ast Q_2 \|_\infty \le \gamma \| Q_1 - Q_2 \|_\infty, \]


Sutton, Richard S, and Andrew G Barto. 2018. Reinforcement Learning: An Introduction. MIT press.

  1. The infinity norm of a matrix \(A\) is defined as \(\|A\|_\infty = \max_i \sum_j |a_{ij}|\), which for a stochastic matrix equals 1. Hence, applying the matrix repeatedly scales down any vector by at most a factor of 1, and multiplying by \(\gamma < 1\) ensures convergence to zero.↩︎

  2. This inequality follows from the fact that for any two sets of real numbers \(\{x_a\}\) and \(\{y_a\}\), \(|\max_a x_a - \max_a y_a| \le \max_a |x_a - y_a|\).↩︎