Our previous lecture introduced the Markov Decision Process (MDP) and the Bellman Equation. There, we evaluated a given policy by solving the linear system (Sutton and Barto 2018)
\[ \bv^\pi = \br^\pi + \gamma \bP^\pi \bv^\pi, \tag{1} \]
where \(\bv^\pi\) is the vector of value-function evaluations over all states, \(\br^\pi\) is the vector of expected rewards, \(\bP^\pi\) is the transition matrix under policy \(\pi\), and \(\gamma \in [0, 1)\) is the discount factor. Although this equation has a closed-form solution, computing or estimating all components can be difficult in practice, especially when the state or action space is large.
In this lecture, we instead view the Bellman Equation as a fixed-point equation for an operator acting on value functions. We will introduce the Bellman operator, verify that it is a contraction under the supremum norm, and show that repeated application of this operator converges to the true value function. This provides the foundation for iterative policy evaluation.
For clarity, throughout this lecture we focus on a deterministic policy. That is, the action at state \(s\) is given by \(a = \pi(s)\) with no randomness. All arguments extend to the stochastic case, but the deterministic setting avoids a further integration. Under a deterministic policy, the value function satisfies
\[ \begin{aligned} V^\pi(s) &= \E_\pi \left[ R + \gamma\, V^\pi(S') \mid S = s \right] \\ &= r(s, \pi(s)) + \gamma \E_{S' \sim \P(\cdot \mid s, \pi(s))} \left[ V^\pi(S') \right], \end{aligned} \]
where \(r(s,a)\) is the expected immediate reward and \(\P(s' \mid s,a)\) is the transition density or probability mass function. This motivates defining the Bellman operator \(\cT^\pi\), which maps any candidate value function \(V\) (a finite or infinite dimensional vector) to a new function:
\[ \cT^\pi V(s) = r(s, \pi(s)) + \gamma \E_{S' \sim \P(\cdot \mid s, \pi(s))} \left[ V(S') \right]. \tag{2} \]
The true value function \(V^\pi\) is the unique fixed point of this operator, meaning that
\[ \cT^\pi V^\pi = V^\pi. \]
We will show in the following that \(\cT^\pi\) is a contraction mapping on the space of bounded functions under the supremum norm. This contraction property implies not only that the fixed point exists and is unique, but also that iteratively applying the operator,
\[ V_\text{new} \;\leftarrow\; \cT^\pi V_\text{old}, \]
converges to \(V^\pi\) regardless of the starting function. This provides a practical method for policy evaluation when \(\bP^\pi\) or \(\br^\pi\) cannot be computed directly.
Before showing that the Bellman operator is a contraction, we review a classical result in functional analysis, the Banach Fixed-point Theorem.
(Banach Fixed-point Theorem) Let \(\cX\) be a complete metric space with distance metric \(d\), and let \(\cT : \cX \to \cX\) be a contraction mapping. That is, there exists a constant \(0 \le \gamma < 1\) such that \[ d(\cT(x_1), \cT(x_2)) \le \gamma \, d(x_1, x_2) \quad \text{for all } x_1, x_2 \in \cX. \] Then \(\cT\) has a unique fixed point \(x^\ast\), and the sequence \((\cT^n x_0)\) converges to \(x^\ast\) for any starting point \(x_0\).
We now verify that the Bellman operator \(\cT^\pi\) is a contraction mapping on the space of bounded value functions equipped with the supremum norm. Recall our definition in (2), consider any two value functions \(V_1\) and \(V_2\). Then
\[ \begin{aligned} \| \cT^\pi V_1 - \cT^\pi V_2 \|_\infty &= \sup_s \left| \gamma \E_{S' \sim \P(\cdot\mid s,\pi(s))} \left[ V_1(S') - V_2(S') \right] \right| \\ &\le \gamma \sup_s \E_{S'} \left[ \left| V_1(S') - V_2(S') \right| \right] \\ &\le \gamma \sup_{s'} \left| V_1(s') - V_2(s') \right| \\ &= \gamma \, \|V_1 - V_2\|_\infty. \end{aligned} \]
The second inequality uses the elementary fact that the expectation of a bounded function is bounded above by its supremum. This argument applies equally in the discrete case (using sums) and the continuous case (using integrals). Thus, \(\cT^\pi\) is a \(\gamma\)-contraction, and Banach’s theorem guarantees the existence and uniqueness of the fixed point \(V^\pi\), as well as convergence of the iteration
\[ V_\text{new} \leftarrow \cT^\pi V_\text{old} \]
to \(V^\pi\) for any initial function \(V\). This result applies to both finite and infinite state spaces. When the state space is finite, the fixed point can also be obtained by solving the linear system in (1). However, when the space is large or continuous, it is often infeasible to compute \(\bP^\pi\) directly. Hence, we may want to empirically estimate the Bellman operator using samples, and then iteratively apply it to approximate \(V^\pi\). We can check that when applying the Bellman operator repeatedly we can reach the limit. Starting from any \(V\), if we apply the Bellman equation twice, we get
\[ \begin{aligned} (\cT^\pi)^2 V &= \br^\pi + \gamma \bP^\pi (\br^\pi + \gamma \bP^\pi V) \\ &= \br^\pi + \gamma \bP^\pi \br^\pi + \gamma^2 (\bP^\pi)^2 V. \end{aligned} \]
A third application yields
\[ (\cT^\pi)^3 V = \br^\pi + \gamma \bP^\pi \br^\pi + \gamma^2 (\bP^\pi)^2 \br^\pi + \gamma^3 (\bP^\pi)^3 V. \]
We can now see a clear pattern: each application introduces an additional term \(\gamma^k (\bP^\pi)^k \br^\pi\), and the influence of the initial \(V\) is pushed forward by another factor of \(\gamma \bP^\pi\). A simple induction argument verifies that for any integer \(n \ge 1\),
\[ (\cT^\pi)^n V = \sum_{k=0}^{n-1} \gamma^k (\bP^\pi)^k \br^\pi + \gamma^n (\bP^\pi)^n V. \]
The last term represents the residual contribution of the initial function \(V\) after \(n\) iterations. Because \(\gamma \in [0,1)\) and \(\bP^\pi\) is a stochastic matrix1, this remainder term vanishes as \(n \to \infty\). Hence the sequence \((\cT^\pi)^n V\) converges to the infinite discounted sum
\[ \sum_{k=0}^{\infty} \gamma^k (\bP^\pi)^k \br^\pi. \]
Noticing that
\[ (\bI - \gamma \bP^\pi) \sum_{k=0}^{\infty} \gamma^k (\bP^\pi)^k = \bI, \]
Therefore,
\[ \lim_{n \to \infty} (\cT^\pi)^n V = (\bI - \gamma \bP^\pi)^{-1} \br^\pi = \bv^\pi. \]
Which is exactly the same as the one we obtained from solving the linear system in (1). This convergence can also be seen directly from the contraction property, by applying it repeatedly:
\[ \| (\cT^\pi)^n V_1 - (\cT^\pi)^n V_2 \|_\infty \le \gamma^n \|V_1 - V_2\|_\infty \longrightarrow 0, \quad n \to \infty. \]
The above discussion can be easily extended to stochastic policies. In this case, the Bellman operator becomes a linear combination over all possible actions, weighted by the policy probabilities:
\[ \cT^\pi V(s) = \sum_a \pi(a \mid s) \left[ r(s,a) + \gamma \E_{S' \sim \P(\cdot \mid s,a)} \left[ V(S') \right] \right]. \]
The contraction property can be verified similarly since the expectation over actions is simply a weighted average and it would not violate the infinity norm bound. Thus, all results for deterministic policies carry over to stochastic policies without change.
When the state and action spaces are continuous, the Bellman operator is defined using integrals instead of sums. For example, if we assume a density function \(p(s' \mid s,a)\) for the transition kernel and a density \(\pi(a \mid s)\) for the stochastic policy, then
\[ \cT^\pi V(s) = \int_a \pi(a \mid s) \left[ r(s,a) + \gamma \int_{s'} p(s' \mid s,a) V(s') ds' \right] da. \]
The contraction property still holds because the integrals are linear operators and the same supremum norm bounds apply.
We can do a simple example to illustrate the iterative policy evaluation using the Bellman operator. Consider a simple MDP with 3 states and 2 actions. The transition probability matrices for action 1 and action 2 are given by
\[ P_1 = \begin{pmatrix} 0.8 & 0.1 & 0.1 \\ 0.05 & 0.05 & 0.9 \\ 0.2 & 0.2 & 0.6 \end{pmatrix} \]
\[ P_2 = \begin{pmatrix} 0.5 & 0.25 & 0.25 \\ 0.1 & 0.8 & 0.1 \\ 0.8 & 0.1 & 0.1 \end{pmatrix} \]
Our immediate reward function is
\[ r = \begin{pmatrix} 5 & 3 \\ 2 & 2.5 \\ 3 & 2 \end{pmatrix} \]
The following code performs iterative policy evaluation for a given stochastic policy:
# Transition matrices and reward matrix as before
P1 = matrix(c(0.8, 0.1, 0.1,
0.05, 0.05, 0.9,
0.2, 0.2, 0.6), byrow=TRUE, nrow=3)
P2 = matrix(c(0.5, 0.25, 0.25,
0.1, 0.8, 0.1,
0.8, 0.1, 0.1), byrow=TRUE, nrow=3)
# define the expected reward for each state and each action
r = matrix(c(5, 3,
2, 2.5,
3, 2), byrow=TRUE, nrow=3)
gamma = 0.7
# Define a stochastic policy: each row is a state,
# Columns represent the probability of taking action 1 and action 2 respectively
Pi = matrix(c(0.8, 0.2,
0.3, 0.7,
0.7, 0.3), byrow = TRUE, nrow = 3)
# Induced reward vector r^pi(s) = sum_a pi(a|s) r(s,a)
r_pi = rowSums(r * Pi)
# Induced transition matrix P^pi(s, s') = sum_a pi(a|s) P_a(s, s')
P_pi = matrix(0, nrow = 3, ncol = 3)
for (s in 1:3) {
P_pi[s, ] = Pi[s, 1] * P1[s, ] + Pi[s, 2] * P2[s, ]
}
# Iterative policy evaluation: V_{k+1} = r^pi + gamma * P^pi V_k
V = c(0, 0, 0) # initial value function
niter = 100
for (k in 1:niter)
{
V_new = as.vector(r_pi + gamma * P_pi %*% V)
if (k <= 6 | k == niter)
{
if (k == niter) cat("\n--------------\n\n")
cat(paste("Iteration", k, "\n"))
print(V_new)
}
V = V_new
}
## Iteration 1
## [1] 4.60 2.35 2.70
## Iteration 2
## [1] 7.442350 4.212175 5.053750
## Iteration 3
## [1] 9.298336 5.691013 6.772845
## Iteration 4
## [1] 10.550749 6.805821 7.984034
## Iteration 5
## [1] 11.411165 7.617313 8.831363
## Iteration 6
## [1] 12.007813 8.196797 9.423709
##
## --------------
##
## Iteration 100
## [1] 13.390040 9.569872 10.803745
# Exact solution (linear system)
cat("\nExact solution:\n")
##
## Exact solution:
V_exact = solve(diag(3) - gamma * P_pi, r_pi)
print(V_exact)
## [1] 13.390040 9.569872 10.803745
So far we have focused on evaluating a given policy (policy evaluation). The next step in the MDP framework is to identify a policy that maximizes long-term return (policy learning). Recall that from our earlier discussion on Q-learning, rolling out the optimal policy \(\pi^\ast\) yields the highest possible value at every state:
\[ V^\ast(s) = \sup_{\pi} \, V^\pi(s) = \E\!\left[ R + \gamma V^\ast(S') \,\big|\, S = s, A = \pi^\ast(s) \right]. \]
Since the optimal policy selects the action that maximizes the continuation value, this leads naturally to the Bellman Optimality Equation:
\[ V^\ast(s) = \max_{a} \left\{ r(s,a) + \gamma \E_{S' \sim p(\cdot \mid s,a)} \left[ V^\ast(S') \right] \right\}. \]
This equation defines the Bellman Optimality Operator, denoted \(\cT^\ast\), which acts on any candidate value function \(V\) as
\[ \cT^\ast V(s) = \max_{a} \left\{ r(s,a) + \gamma \E_{S' \sim p(\cdot \mid s,a)} \left[ V(S') \right] \right\}. \]
The true optimal value function \(V^\ast\) is the unique fixed point of this operator: \[ \cT^\ast V^\ast = V^\ast. \]
And natuarally, the optimal policy can be extracted as the greedy policy with respect to \(V^\ast\):
\[ \pi^\ast(s) = \arg\max_a \left\{ r(s,a) + \gamma \E_{S' \sim p(\cdot \mid s,a)} \left[ V^\ast(S') \right] \right\}. \]
We can also check that, although \(\cT^\ast\) involves a maximization, it retains the same contraction behavior as the policy-specific Bellman operator. For any two bounded value functions \(V_1\) and \(V_2\)
\[ \begin{aligned} \big\| \cT^\ast V_1 - \cT^\ast V_2 \big\|_\infty &= \sup_s \left| \max_a \left\{ r(s,a) + \gamma \E[V_1(S') \mid s,a] \right\} - \max_a \left\{ r(s,a) + \gamma \E[V_2(S') \mid s,a] \right\} \right| \\ &\le \sup_s \max_a \left| \gamma \E[V_1(S') - V_2(S') \mid s,a] \right| \\ &= \gamma \sup_s \max_a \int_{s'} | V_1(s') - V_2(s') | \,\, p(s' \mid s,a) ds' \\ &\le \gamma \sup_s \max_a \int_{s'} \| V_1 - V_2 \|_\infty p(s' \mid s,a) ds' \\ &= \gamma \| V_1 - V_2 \|_\infty. \end{aligned} \]
where the first inequality uses the fact that the difference of two maxima is bounded by the maximum of their differences2, and the subsequent steps uses standard supremum norm bounds. Thus \(\cT^\ast\) is a \(\gamma\)-contraction under the supremum norm. By Banach’s Fixed-point Theorem, the fixed point \(V^\ast\) exists, is unique, and can be obtained by repeatedly applying \(\cT^\ast\):
\[ V_\text{new} = \cT^\ast V_\text{old}, \]
Regardless of the initial guess \(V_0\), the sequence \((V_k)\) converges to the unique fixed point \(V^\ast\), with the optimal policy extracted as the greedy policy with respect to \(V^\ast\), defined previously. This iterative procedure to find the optimal value function is known as Value Iteration.
In the following, we illustrate how to apply the Bellman Optimality operator to the example MDP developed earlier. Since the transition probabilities and expected rewards are assumed known, this setting allows us to perform exact value iteration and observe the emergence of the optimal policy. We use the same transition matrices and reward structure as before. The example was designed such that
# Define the transition probability matrix and rewards
P1 = matrix(c(0.8, 0.1, 0.1,
0.05, 0.05, 0.9,
0.2, 0.2, 0.6), byrow=TRUE, nrow=3)
P2 = matrix(c(0.5, 0.25, 0.25,
0.1, 0.8, 0.1,
0.8, 0.1, 0.1), byrow=TRUE, nrow=3)
# define the expected reward for each state and each action
r = matrix(c(5, 3,
2, 2.5,
3, 2), byrow=TRUE, nrow=3)
gamma = 0.7
# start with some arbitrary value function
V = c(0, 0, 0)
# recursively update the value function using the Bellman Optimality operator
# while rolling out the policy
niter = 20
for (k in 1:niter)
{
# Calculate the Q function, for each state and each action
Q = matrix(NA, nrow=3, ncol=2)
Q[, 1] = r[, 1] + gamma * P1 %*% V
Q[, 2] = r[, 2] + gamma * P2 %*% V
# update the value function using the bellman optimality operator
V = apply(Q, 1, max)
SummaryMat = cbind("Policy" = apply(Q, 1, which.max),
"Value" = V)
if (k <= 4 | k == niter)
{
if (k == niter) cat("\n--------------\n\n")
cat(paste("Iteration", k, "\n"))
print(SummaryMat)
}
}
## Iteration 1
## Policy Value
## [1,] 1 5.0
## [2,] 2 2.5
## [3,] 1 3.0
## Iteration 2
## Policy Value
## [1,] 1 8.185
## [2,] 2 4.460
## [3,] 1 5.310
## Iteration 3
## Policy Value
## [1,] 1 10.26750
## [2,] 2 5.94225
## [3,] 2 7.26750
## Iteration 4
## Policy Value
## [1,] 1 11.674482
## [2,] 1 7.145866
## [3,] 2 8.674482
##
## --------------
##
## Iteration 20
## Policy Value
## [1,] 1 14.90083
## [2,] 1 10.37910
## [3,] 2 11.90083
This is clearly better than the previously defined stochastic policy, in terms of the value function. In fact, after just four iterations, we have already figured out the optimal policy, but it will take a bit longer to figure out the optimal value function. We can validate the policy value using Monte Carlo simulation:
# write a function to simulate the MDP
simulate_MDP <- function(n, policy, P1, P2, r, gamma, s0){
# prepare the data
S = rep(NA, n)
A = rep(NA, n)
R = rep(NA, n)
# lets start from the first state
S[1] = s0
# sequential simulation
for (t in 1:n){
# take an action based on the policy A_t
A[t] = sample(1:2, 1, prob = policy[S[t],])
# get the reward r(S_t, A_t) with a bit of noise
R[t] = r[S[t], A[t]] + rnorm(1)*0.1
if (t < n){
# get the next state depending on the action
if (A[t] == 1){
S[t+1] = sample(1:3, 1, prob = P1[S[t],])
} else {
S[t+1] = sample(1:3, 1, prob = P2[S[t],])
}
}
}
#record the cumulative discounted reward
return(list(S = S, A = A, R = R, cum_reward = sum(gamma^seq(0, n-1) %*% R)))
}
# Simulate the data
# generate 100 simulations
nsim = 1000
cum_reward = matrix(NA, nsim, 3)
set.seed(1)
# the best policy is to pick action 1 all the time
mypolicy = matrix(c(1, 0,
1, 0,
0, 1), byrow=TRUE, nrow=3)
nchain = 100
for (k in 1:nsim)
{
#record the cumulative discounted reward
cum_reward[k, 1] = simulate_MDP(nchain, mypolicy, P1, P2, r, gamma, 1)$cum_reward
cum_reward[k, 2] = simulate_MDP(nchain, mypolicy, P1, P2, r, gamma, 2)$cum_reward
cum_reward[k, 3] = simulate_MDP(nchain, mypolicy, P1, P2, r, gamma, 3)$cum_reward
}
# simulated mean value function
colMeans(cum_reward)
## [1] 14.79509 10.41573 11.87744
Similarly to the value function version, we can also derive the Bellman operator and the Bellman optimality operator for the Q-function. Recall our Bellman equation of the Q function, we can correspondingly define the Bellman operator under a fixed policy \(\pi\) as:
\[ \cT_Q^\pi Q(s,a) = r(s,a) + \gamma \E_{S' \sim p(\cdot \mid s,a)} \left[ \sum_{a'} \pi(a' \mid S') Q(S', a') \right]. \]
Under the supremum norm defined as \(\| Q \|_\infty = \sup_{s,a} |Q(s,a)|\), its easy to see that this operator is also a contraction mapping:
\[ \begin{aligned} &\big| (\cT_Q^\pi Q_1)(s,a) - (\cT_Q^\pi Q_2)(s,a) \big| \\ =& \, \gamma \, \left| \E\!\left[ Q_1(S',A') - Q_2(S',A') \,\big|\, S=s, A=a \right] \right| \\ \le& \, \gamma \, \E\!\left[ | Q_1(S',A') - Q_2(S',A') | \,\big|\, S=s, A=a \right] \\ \le& \, \gamma \, \E\!\left[ \| Q_1 - Q_2 \|_\infty \,\big|\, S=s, A=a \right] \\ =& \, \gamma \, \| Q_1 - Q_2 \|_\infty. \end{aligned} \]
Taking the supremum over \((s,a)\), we have
\[ \| \cT_Q^\pi Q_1 - \cT_Q^\pi Q_2 \|_\infty \le \gamma \| Q_1 - Q_2 \|_\infty. \]
Analogous to the value-function case, define the optimal Bellman operator on Q-functions as
\[ \cT_Q^\ast Q(s,a) = r(s,a) + \gamma \E_{S' \sim p(\cdot \mid s,a)} \big[ \max_{a'} Q(S', a') \big], \]
with the optimal policy extracted as the greedy policy with respect to \(Q^\ast\):
\[ \pi^\ast(s) \in \arg\max_a Q^\ast(s,a). \]
The contraction can be easily validated with
\[ \begin{aligned} &\big| (\cT_Q^\ast Q_1)(s,a) - (\cT_Q^\ast Q_2)(s,a) \big| \\ =& \, \gamma \, \left| \E\!\left[ \max_{a'} Q_1(S',a') - \max_{a'} Q_2(S',a') \,\big|\, S=s, A=a \right] \right| \\ \le& \, \gamma \, \E\!\left[ \big| \max_{a'} Q_1(S',a') - \max_{a'} Q_2(S',a') \big| \,\big|\, S=s, A=a \right] \\ \le& \, \gamma \, \E\!\left[ \max_{a'} | Q_1(S',a') - Q_2(S',a') | \,\big|\, S=s, A=a \right] \\ \le& \, \gamma \, \sup_{s',a'} | Q_1(s',a') - Q_2(s',a') | \\ =& \, \gamma \, \| Q_1 - Q_2 \|_\infty. \end{aligned} \]
Therefore
\[ \| \cT_Q^\ast Q_1 - \cT_Q^\ast Q_2 \|_\infty \le \gamma \| Q_1 - Q_2 \|_\infty, \]
The infinity norm of a matrix \(A\) is defined as \(\|A\|_\infty = \max_i \sum_j |a_{ij}|\), which for a stochastic matrix equals 1. Hence, applying the matrix repeatedly scales down any vector by at most a factor of 1, and multiplying by \(\gamma < 1\) ensures convergence to zero.↩︎
This inequality follows from the fact that for any two sets of real numbers \(\{x_a\}\) and \(\{y_a\}\), \(|\max_a x_a - \max_a y_a| \le \max_a |x_a - y_a|\).↩︎