Our previous lecture introduced the Markov Decision Process (MDP) and the Bellman Equation. The way we use the Bellman Equation was to evaluate a policy through analytically solving the equation
\[ \bv^\pi = \br^\pi + \gamma P^\pi \bv^\pi, \tag{1} \]
where \(\bv^\pi\) is a vector of value functions for all possible states, \(\br^\pi\) is a vector of all expected reward, \(P^\pi\) is the transition matrix, and \(\gamma\) is the discount factor. However, being able to calculate all necessary components in this equation is not always feasible. In this lecture, we will introduce the Contraction Mapping Theorem, which provides a way to alliteratively update the estimated value function until convergence.
Previously we have established the property of a value function which is known as the Bellman equation:
\[ \begin{aligned} \cV^\pi(s) &= \E_\pi \left[ R + \gamma \cV^\pi(S') | S = s \right]. \\ &= r(s, a = \pi(s)) + \gamma \E_{s' \sim p(s' | s, a = \pi(s)} \left[ \cV^\pi(s') \right]. \end{aligned} \]
Here, we assumed a deterministic policy \(\pi\). If we are interested in a stochastic policy, further expectation is needed to integrate out the randomness in the policy. We can view the right hand size as an operator applied to the value function:
\[ \cT^\pi \cV(s) = r(s, a = \pi(s)) + \gamma \E_{s' \sim p(s' | s, a = \pi(s)} \left[ \cV(s') \right]. \tag{2} \]
Noticing that the above operator can be applied to any value function, but when the value function is \(\cV^\pi\), the operator reaches a fixed point which can be written as
\[ \cT^\pi \cV^\pi = \cV^\pi. \]
The contraction mapping theorem justifies the existence and uniqueness of this fixed point.
Theorem 1 (Banach Fixed-point Theorem) Let \(\cX\) be a complete metric space, equipped with a distance metric \(d\), and \(\cT: \cX \rightarrow \cX\) be a contraction mapping, i.e., there exists a constant \(0 \leq \gamma < 1\) such that
\[ d(\cT(x_1), \cT(x_2)) \leq \gamma \, d(x_1, x_2), \quad \forall x_1, x_2 \in \cX. \]
Then, there exists a unique fixed point \(x^*\) such that
\[ \cT(x^*) = x^*. \] \(\square\)
The Bellman operator is a contraction mapping. To see this, let’s consider two value functions \(\cV_1\) and \(\cV_2\), and use the infinity norm \(\| \cdot \|_\infty\) as the distance metric. If we consider a deterministic policy \(a = \pi(s)\), then we have
\[ \begin{aligned} \| \cT^\pi \cV_1 - \cT^\pi \cV_2 \|_\infty &= \| \cT^\pi \cV_1(s) - \cT^\pi \cV_2(s) \|_\infty \\ &= \gamma \left\| \E_{\pi, s'} \Big[ \cV_1(s') - \cV_2(s') \Big] \right\|_\infty \\ &\leq \gamma \int_{s'} \Big\| \cV_1(s') - \cV_2(s') \Big\|_\infty p(s' | s, a = \pi(s)) ds' \qquad \text{(by Janson's Inequality)} \\ &= \gamma \big\| \cV_1 - \cV_2 \big\|_\infty. \qquad \text{($p(\cdot)$ is a density)} \\ \end{aligned} \]
The inequality line can also be changed into a finite sum if the state space is discrete. In any case, the Bellman operator is a contraction mapping. Hence, the Banach Fixed-point Theorem guarantees the existence and uniqueness of the fixed point of the Bellman operator. When there is only a finite number of states, the solution we introduced in the previous lecture is essentially solving the fixed point of the Bellman operator, which is performing the policy evaluation.
\[ \bv^\pi = (I - \gamma P^\pi)^{-1} \br^\pi. \]
However, this may not always be the best approach since there can be a large number of state and action pairs, making it difficulty to accurately estimate the transition matrix \(P^\pi\) and expected rewards \(\br^\pi\). In fact, recursively updating the value function using the Bellman operator will give the same result1. This can be seem by recursively applying the contraction property of the Bellman operator:
\[ \| (\cT^\pi)^2 \cV_1 - (\cT^\pi)^n \cV_2 \|_\infty \leq \gamma^n \big\| \cV_1 - \cV_2 \big\|_\infty \rightarrow 0, \quad \text{as } n \rightarrow \infty. \]
So far we have not discussed how to obtain the optimal policy in the MDP setting. But we do had an earlier observation from the batch Q-learning lecture that if we roll out the optimal policy, the value function will be the maximum, i.e.,
\[ \cV^\ast(s) = \E_{\pi^\ast} \left[ R + \gamma \cV^\ast(S') | S = s, A = a \right]. \]
Note that the optimal policy \(\pi^\ast\) selects the action that maximizes the value function. This leads to the Bellman Optimality Operator:
\[ \cT^\ast \cV^\ast(s) = \max_a \left[ R_t + \gamma \E_{s' \sim p(s' | s, a)} \left[ \cV^\ast(s') \right] \right]. \]
We can also show that this operator is a \(\gamma\)-contraction with respect to the supremum norm:
\[ \begin{aligned} \| \cT^\ast \cV_1 - \cT^\ast \cV_2 \|_\infty &= \| \cT^\pi \cV_1(s) - \cT^\pi \cV_2(s) \|_\infty \\ &= \gamma \left\| \sup_{a} \E_{\pi, \, s' \sim p(s'| s, a)} \Big[ \cV_1(s') - \cV_2(s') \Big] \right\|_\infty \\ &\leq \gamma \int_{s'} \int_a \Big\| \cV_1(s') - \cV_2(s') \Big\|_\infty p(s' | s, a = \pi(s)) \pi(a | s) ds'da \\ &= \gamma \big\| \cV_1 - \cV_2 \big\|_\infty. \\ \end{aligned} \]
In this case, we will not be able to analytically solve the Bellman Optimality Equation, but we can still iteratively update the value function until convergence. Following this idea, we can try our previously generated data to perform policy learning, assuming that all the transition probabilities and expected rewards are known.
The above understanding of the Bellman Optimality Equation immediately lead to the Value Iteration algorithm. The idea is to iteratively update the value function using the Bellman Optimality Operator until convergence, starting with some arbitrary value. Note that the formulation requires knowing the transition probabilities and expected rewards. Although this could be dealt with using the sample average, we will simply assume the truth. Furthermore, using the Bellman Optimality Operator requires maximizing over all possible actions, which is essentially calculating the Q function and pick the largest one. We will design our example in the following way:
\[ P_1 = \begin{pmatrix} 0.8 & 0.1 & 0.1 \\ 0.05 & 0.05 & 0.9 \\ 0.8 & 0.1 & 0.1 \end{pmatrix} \]
\[ P_2 = \begin{pmatrix} 0.5 & 0.25 & 0.25 \\ 0.1 & 0.8 & 0.1 \\ 0.2 & 0.2 & 0.6 \end{pmatrix} \]
Our immediate reward function is
\[ r = \begin{pmatrix} 5 & 3 \\ 1.6 & 3 \\ 4 & 2 \end{pmatrix} \]
We can see that picking action 1 is better for stages 1 and 3 immediately, but not for stage 2. However, based on our transitional kernel design, pick action 1 would allow us to transit into stage 3, and then back to 1 faster and eventually attain a better reward.
# Define the transition probability matrix and rewards
P1 = matrix(c(0.8, 0.1, 0.1,
0.05, 0.05, 0.9,
0.8, 0.1, 0.1), byrow=TRUE, nrow=3)
P2 = matrix(c(0.5, 0.25, 0.25,
0.1, 0.8, 0.1,
0.2, 0.2, 0.6), byrow=TRUE, nrow=3)
# define the expected reward for each state and each action
r = matrix(c(5, 3,
1.6, 3,
4, 2), byrow=TRUE, nrow=3)
gamma = 0.7
# start with some arbitrary value function
V = c(0, 0, 0)
# recursively update the value function using the Bellman Optimality operator
# while rolling out the policy
niter = 100
for (k in 1:niter)
{
# Calculate the Q function, for each state and each action
Q = matrix(NA, nrow=3, ncol=2)
Q[, 1] = r[, 1] + gamma * P1 %*% V
Q[, 2] = r[, 2] + gamma * P2 %*% V
# update the value function using the bellman optimality operator
V = apply(Q, 1, max)
SummaryMat = cbind("Policy" = apply(Q, 1, which.max),
"Value" = V)
if (k <= 6 | k == niter)
{
if (k == niter) cat("\n--------------\n\n")
cat(paste("Iteration", k, "\n"))
print(SummaryMat)
}
}
## Iteration 1
## Policy Value
## [1,] 1 5
## [2,] 2 3
## [3,] 1 4
## Iteration 2
## Policy Value
## [1,] 1 8.29
## [2,] 2 5.31
## [3,] 1 7.29
## Iteration 3
## Policy Value
## [1,] 1 10.5244
## [2,] 2 7.0642
## [3,] 1 9.5244
## Iteration 4
## Policy Value
## [1,] 1 12.054866
## [2,] 2 8.359368
## [3,] 1 11.054866
## Iteration 5
## Policy Value
## [1,] 1 13.109721
## [2,] 2 9.298927
## [3,] 1 12.109721
## Iteration 6
## Policy Value
## [1,] 1 13.84005
## [2,] 1 10.01343
## [3,] 1 12.84005
##
## --------------
##
## Iteration 100
## Policy Value
## [1,] 1 15.54058
## [2,] 1 11.71449
## [3,] 1 14.54058
In fact, after just a few iterations, we have already figured out the optimal policy, but it took a bit longer to figure out the optimal value function. We can validate the policy value using Monte Carlo simulation:
# write a function to simulate the MDP
simulate_MDP <- function(n, policy, P1, P2, r, gamma, s0){
# prepare the data
S = rep(NA, n)
A = rep(NA, n)
R = rep(NA, n)
# lets start from the first state
S[1] = s0
# sequential simulation
for (t in 1:n){
# take an action based on the policy A_t
A[t] = sample(1:2, 1, prob = policy[S[t],])
# get the reward r(S_t, A_t) with a bit of noise
R[t] = r[S[t], A[t]] + rnorm(1)*0.1
if (t < n){
# get the next state depending on the action
if (A[t] == 1){
S[t+1] = sample(1:3, 1, prob = P1[S[t],])
} else {
S[t+1] = sample(1:3, 1, prob = P2[S[t],])
}
}
}
#record the cumulative discounted reward
return(list(S = S, A = A, R = R, cum_reward = sum(gamma^seq(0, n-1) %*% R)))
}
# Simulate the data
# generate 100 simulations
nsim = 1000
cum_reward = matrix(NA, nsim, 3)
set.seed(1)
# the best policy is to pick action 1 all the time
mypolicy = matrix(c(1, 0,
1, 0,
1, 0), byrow=TRUE, nrow=3)
nchain = 100
for (k in 1:nsim)
{
#record the cumulative discounted reward
cum_reward[k, 1] = simulate_MDP(nchain, mypolicy, P1, P2, r, gamma, 1)$cum_reward
cum_reward[k, 2] = simulate_MDP(nchain, mypolicy, P1, P2, r, gamma, 2)$cum_reward
cum_reward[k, 3] = simulate_MDP(nchain, mypolicy, P1, P2, r, gamma, 3)$cum_reward
}
# simulated mean value function
colMeans(cum_reward)
## [1] 15.47050 11.72851 14.51275
We can iteratively apply the operator to any arbitrary starting value of \(\cV = x\), i.e., \((\cT^\pi)^n v\). This can then be re-organized into \((\bI - (\gamma P^\pi)^n ) (\bI - \gamma P^\pi)^{-1} \br + (\gamma P^\pi)^n x\), and since \((\gamma P^\pi)^n \rightarrow 0\), we can obtain the same solution.↩︎