\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\cA}{\mathcal{A}}\) \(\newcommand{\cB}{\mathcal{B}}\) \(\newcommand{\cC}{\mathcal{C}}\) \(\newcommand{\cD}{\mathcal{D}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cF}{\mathcal{F}}\) \(\newcommand{\cG}{\mathcal{G}}\) \(\newcommand{\cH}{\mathcal{H}}\) \(\newcommand{\cI}{\mathcal{I}}\) \(\newcommand{\cJ}{\mathcal{J}}\) \(\newcommand{\cK}{\mathcal{K}}\) \(\newcommand{\cL}{\mathcal{L}}\) \(\newcommand{\cM}{\mathcal{M}}\) \(\newcommand{\cN}{\mathcal{N}}\) \(\newcommand{\cO}{\mathcal{O}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\cQ}{\mathcal{Q}}\) \(\newcommand{\cR}{\mathcal{R}}\) \(\newcommand{\cS}{\mathcal{S}}\) \(\newcommand{\cT}{\mathcal{T}}\) \(\newcommand{\cU}{\mathcal{U}}\) \(\newcommand{\cV}{\mathcal{V}}\) \(\newcommand{\cW}{\mathcal{W}}\) \(\newcommand{\cX}{\mathcal{X}}\) \(\newcommand{\cY}{\mathcal{Y}}\) \(\newcommand{\cZ}{\mathcal{Z}}\) \(\newcommand{\bA}{\mathbf{A}}\) \(\newcommand{\bB}{\mathbf{B}}\) \(\newcommand{\bC}{\mathbf{C}}\) \(\newcommand{\bD}{\mathbf{D}}\) \(\newcommand{\bE}{\mathbf{E}}\) \(\newcommand{\bF}{\mathbf{F}}\) \(\newcommand{\bG}{\mathbf{G}}\) \(\newcommand{\bH}{\mathbf{H}}\) \(\newcommand{\bI}{\mathbf{I}}\) \(\newcommand{\bJ}{\mathbf{J}}\) \(\newcommand{\bK}{\mathbf{K}}\) \(\newcommand{\bL}{\mathbf{L}}\) \(\newcommand{\bM}{\mathbf{M}}\) \(\newcommand{\bN}{\mathbf{N}}\) \(\newcommand{\bO}{\mathbf{O}}\) \(\newcommand{\bP}{\mathbf{P}}\) \(\newcommand{\bQ}{\mathbf{Q}}\) \(\newcommand{\bR}{\mathbf{R}}\) \(\newcommand{\bS}{\mathbf{S}}\) \(\newcommand{\bT}{\mathbf{T}}\) \(\newcommand{\bU}{\mathbf{U}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\ba}{\mathbf{a}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\bd}{\mathbf{d}}\) \(\newcommand{\be}{\mathbf{e}}\) \(\newcommand{\bg}{\mathbf{g}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bi}{\mathbf{i}}\) \(\newcommand{\bj}{\mathbf{j}}\) \(\newcommand{\bk}{\mathbf{k}}\) \(\newcommand{\bl}{\mathbf{l}}\) \(\newcommand{\bm}{\mathbf{m}}\) \(\newcommand{\bn}{\mathbf{n}}\) \(\newcommand{\bo}{\mathbf{o}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bq}{\mathbf{q}}\) \(\newcommand{\br}{\mathbf{r}}\) \(\newcommand{\bs}{\mathbf{s}}\) \(\newcommand{\bt}{\mathbf{t}}\) \(\newcommand{\bu}{\mathbf{u}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\hpi}{\widehat{\pi}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\hbpi}{\widehat{\boldsymbol{\pi}}}\) \(\newcommand{\bxi}{\boldsymbol{\xi}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bzero}{\mathbf{0}}\) \(\newcommand{\T}{\text{T}}\) \(\newcommand{\Trace}{\text{Trace}}\) \(\newcommand{\Cov}{\text{Cov}}\) \(\newcommand{\Var}{\text{Var}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\Pr}{\text{Pr}}\) \(\newcommand{\pr}{\text{pr}}\) \(\newcommand{\pdf}{\text{pdf}}\) \(\newcommand{\P}{\text{P}}\) \(\newcommand{\p}{\text{p}}\) \(\newcommand{\One}{\mathbf{1}}\) \(\newcommand{\argmin}{\operatorname*{arg\,min}}\) \(\newcommand{\argmax}{\operatorname*{arg\,max}}\)

Overview

Our previous lecture introduced the Markov Decision Process (MDP) and the Bellman Equation. The way we use the Bellman Equation was to evaluate a policy through analytically solving the equation

\[ \bv^\pi = \br^\pi + \gamma P^\pi \bv^\pi, \tag{1} \]

where \(\bv^\pi\) is a vector of value functions for all possible states, \(\br^\pi\) is a vector of all expected reward, \(P^\pi\) is the transition matrix, and \(\gamma\) is the discount factor. However, being able to calculate all necessary components in this equation is not always feasible. In this lecture, we will introduce the Contraction Mapping Theorem, which provides a way to alliteratively update the estimated value function until convergence.

Bellman Operator

Previously we have established the property of a value function which is known as the Bellman equation:

\[ \begin{aligned} \cV^\pi(s) &= \E_\pi \left[ R + \gamma \cV^\pi(S') | S = s \right]. \\ &= r(s, a = \pi(s)) + \gamma \E_{s' \sim p(s' | s, a = \pi(s)} \left[ \cV^\pi(s') \right]. \end{aligned} \]

Here, we assumed a deterministic policy \(\pi\). If we are interested in a stochastic policy, further expectation is needed to integrate out the randomness in the policy. We can view the right hand size as an operator applied to the value function:

\[ \cT^\pi \cV(s) = r(s, a = \pi(s)) + \gamma \E_{s' \sim p(s' | s, a = \pi(s)} \left[ \cV(s') \right]. \tag{2} \]

Noticing that the above operator can be applied to any value function, but when the value function is \(\cV^\pi\), the operator reaches a fixed point which can be written as

\[ \cT^\pi \cV^\pi = \cV^\pi. \]

The contraction mapping theorem justifies the existence and uniqueness of this fixed point.

Banach’s fixed-point theorem

Theorem 1 (Banach Fixed-point Theorem) Let \(\cX\) be a complete metric space, equipped with a distance metric \(d\), and \(\cT: \cX \rightarrow \cX\) be a contraction mapping, i.e., there exists a constant \(0 \leq \gamma < 1\) such that

\[ d(\cT(x_1), \cT(x_2)) \leq \gamma \, d(x_1, x_2), \quad \forall x_1, x_2 \in \cX. \]

Then, there exists a unique fixed point \(x^*\) such that

\[ \cT(x^*) = x^*. \] \(\square\)

Contraction of the Bellman Operator

The Bellman operator is a contraction mapping. To see this, let’s consider two value functions \(\cV_1\) and \(\cV_2\), and use the infinity norm \(\| \cdot \|_\infty\) as the distance metric. If we consider a deterministic policy \(a = \pi(s)\), then we have

\[ \begin{aligned} \| \cT^\pi \cV_1 - \cT^\pi \cV_2 \|_\infty &= \| \cT^\pi \cV_1(s) - \cT^\pi \cV_2(s) \|_\infty \\ &= \gamma \left\| \E_{\pi, s'} \Big[ \cV_1(s') - \cV_2(s') \Big] \right\|_\infty \\ &\leq \gamma \int_{s'} \Big\| \cV_1(s') - \cV_2(s') \Big\|_\infty p(s' | s, a = \pi(s)) ds' \qquad \text{(by Janson's Inequality)} \\ &= \gamma \big\| \cV_1 - \cV_2 \big\|_\infty. \qquad \text{($p(\cdot)$ is a density)} \\ \end{aligned} \]

The inequality line can also be changed into a finite sum if the state space is discrete. In any case, the Bellman operator is a contraction mapping. Hence, the Banach Fixed-point Theorem guarantees the existence and uniqueness of the fixed point of the Bellman operator. When there is only a finite number of states, the solution we introduced in the previous lecture is essentially solving the fixed point of the Bellman operator, which is performing the policy evaluation.

\[ \bv^\pi = (I - \gamma P^\pi)^{-1} \br^\pi. \]

However, this may not always be the best approach since there can be a large number of state and action pairs, making it difficulty to accurately estimate the transition matrix \(P^\pi\) and expected rewards \(\br^\pi\). In fact, recursively updating the value function using the Bellman operator will give the same result1. This can be seem by recursively applying the contraction property of the Bellman operator:

\[ \| (\cT^\pi)^2 \cV_1 - (\cT^\pi)^n \cV_2 \|_\infty \leq \gamma^n \big\| \cV_1 - \cV_2 \big\|_\infty \rightarrow 0, \quad \text{as } n \rightarrow \infty. \]

Bellman Optimality Equation and Policy Optimization

So far we have not discussed how to obtain the optimal policy in the MDP setting. But we do had an earlier observation from the batch Q-learning lecture that if we roll out the optimal policy, the value function will be the maximum, i.e.,

\[ \cV^\ast(s) = \E_{\pi^\ast} \left[ R + \gamma \cV^\ast(S') | S = s, A = a \right]. \]

Note that the optimal policy \(\pi^\ast\) selects the action that maximizes the value function. This leads to the Bellman Optimality Operator:

\[ \cT^\ast \cV^\ast(s) = \max_a \left[ R_t + \gamma \E_{s' \sim p(s' | s, a)} \left[ \cV^\ast(s') \right] \right]. \]

We can also show that this operator is a \(\gamma\)-contraction with respect to the supremum norm:

\[ \begin{aligned} \| \cT^\ast \cV_1 - \cT^\ast \cV_2 \|_\infty &= \| \cT^\pi \cV_1(s) - \cT^\pi \cV_2(s) \|_\infty \\ &= \gamma \left\| \sup_{a} \E_{\pi, \, s' \sim p(s'| s, a)} \Big[ \cV_1(s') - \cV_2(s') \Big] \right\|_\infty \\ &\leq \gamma \int_{s'} \int_a \Big\| \cV_1(s') - \cV_2(s') \Big\|_\infty p(s' | s, a = \pi(s)) \pi(a | s) ds'da \\ &= \gamma \big\| \cV_1 - \cV_2 \big\|_\infty. \\ \end{aligned} \]

In this case, we will not be able to analytically solve the Bellman Optimality Equation, but we can still iteratively update the value function until convergence. Following this idea, we can try our previously generated data to perform policy learning, assuming that all the transition probabilities and expected rewards are known.

Example: Value Iteration

The above understanding of the Bellman Optimality Equation immediately lead to the Value Iteration algorithm. The idea is to iteratively update the value function using the Bellman Optimality Operator until convergence, starting with some arbitrary value. Note that the formulation requires knowing the transition probabilities and expected rewards. Although this could be dealt with using the sample average, we will simply assume the truth. Furthermore, using the Bellman Optimality Operator requires maximizing over all possible actions, which is essentially calculating the Q function and pick the largest one. We will design our example in the following way:

  • Stage 1 is will give the highest reward, picking action 1 will most likely to stay in the same state
  • Stage 2 has slightly lower reward, one may pick action 2 to stay in stage 2, or action 1 to transit to stage 3
  • Stage 3 has the lowest reward, but action 1 will most likely to transit into stage 1, which can give larger reward in the long run.
  • Overall, the optimal policy should be to pick action 1, which most likely to stay in stage 1, or if it was transited to stage 2, then pick action 1 to transit back to stage 3, and then pick action 1 again to transit back to stage 1.

\[ P_1 = \begin{pmatrix} 0.8 & 0.1 & 0.1 \\ 0.05 & 0.05 & 0.9 \\ 0.8 & 0.1 & 0.1 \end{pmatrix} \]

\[ P_2 = \begin{pmatrix} 0.5 & 0.25 & 0.25 \\ 0.1 & 0.8 & 0.1 \\ 0.2 & 0.2 & 0.6 \end{pmatrix} \]

Our immediate reward function is

\[ r = \begin{pmatrix} 5 & 3 \\ 1.6 & 3 \\ 4 & 2 \end{pmatrix} \]

We can see that picking action 1 is better for stages 1 and 3 immediately, but not for stage 2. However, based on our transitional kernel design, pick action 1 would allow us to transit into stage 3, and then back to 1 faster and eventually attain a better reward.

  # Define the transition probability matrix and rewards 

  P1 = matrix(c(0.8, 0.1, 0.1,
                0.05, 0.05, 0.9,
                0.8, 0.1, 0.1), byrow=TRUE, nrow=3)

  P2 = matrix(c(0.5, 0.25, 0.25,
                0.1, 0.8, 0.1,
                0.2, 0.2, 0.6), byrow=TRUE, nrow=3)

  # define the expected reward for each state and each action
  r = matrix(c(5, 3,
               1.6, 3,
               4, 2), byrow=TRUE, nrow=3)
  
  gamma = 0.7
  
  # start with some arbitrary value function
  V = c(0, 0, 0)

  # recursively update the value function using the Bellman Optimality operator
  # while rolling out the policy
  niter = 100
  for (k in 1:niter)
  {
    # Calculate the Q function, for each state and each action
    Q = matrix(NA, nrow=3, ncol=2)
    
    Q[, 1] = r[, 1] + gamma * P1 %*% V
    Q[, 2] = r[, 2] + gamma * P2 %*% V
    
    # update the value function using the bellman optimality operator
    V = apply(Q, 1, max)
    
    SummaryMat = cbind("Policy" = apply(Q, 1, which.max),
                       "Value" = V)
    
    if (k <= 6 | k == niter)
    {
      if (k == niter) cat("\n--------------\n\n")
      cat(paste("Iteration", k, "\n"))
      print(SummaryMat)
    }
  }
## Iteration 1 
##      Policy Value
## [1,]      1     5
## [2,]      2     3
## [3,]      1     4
## Iteration 2 
##      Policy Value
## [1,]      1  8.29
## [2,]      2  5.31
## [3,]      1  7.29
## Iteration 3 
##      Policy   Value
## [1,]      1 10.5244
## [2,]      2  7.0642
## [3,]      1  9.5244
## Iteration 4 
##      Policy     Value
## [1,]      1 12.054866
## [2,]      2  8.359368
## [3,]      1 11.054866
## Iteration 5 
##      Policy     Value
## [1,]      1 13.109721
## [2,]      2  9.298927
## [3,]      1 12.109721
## Iteration 6 
##      Policy    Value
## [1,]      1 13.84005
## [2,]      1 10.01343
## [3,]      1 12.84005
## 
## --------------
## 
## Iteration 100 
##      Policy    Value
## [1,]      1 15.54058
## [2,]      1 11.71449
## [3,]      1 14.54058

In fact, after just a few iterations, we have already figured out the optimal policy, but it took a bit longer to figure out the optimal value function. We can validate the policy value using Monte Carlo simulation:

  # write a function to simulate the MDP
  simulate_MDP <- function(n, policy, P1, P2, r, gamma, s0){
    # prepare the data
    S = rep(NA, n)
    A = rep(NA, n)
    R = rep(NA, n)
    
    # lets start from the first state
    S[1] = s0
    
    # sequential simulation
    for (t in 1:n){
      
      # take an action based on the policy A_t
      A[t] = sample(1:2, 1, prob = policy[S[t],])
      
      # get the reward r(S_t, A_t) with a bit of noise
      R[t] = r[S[t], A[t]] + rnorm(1)*0.1
      
      if (t < n){
        # get the next state depending on the action
        if (A[t] == 1){
          S[t+1] = sample(1:3, 1, prob = P1[S[t],])
        } else {
          S[t+1] = sample(1:3, 1, prob = P2[S[t],])
        }
      }
    }
  
    #record the cumulative discounted reward
    return(list(S = S, A = A, R = R, cum_reward = sum(gamma^seq(0, n-1) %*% R)))
  }

  # Simulate the data
  # generate 100 simulations
  nsim = 1000
  cum_reward = matrix(NA, nsim, 3)
  set.seed(1)
  
  # the best policy is to pick action 1 all the time
  mypolicy = matrix(c(1, 0,
                      1, 0,
                      1, 0), byrow=TRUE, nrow=3)
  
  nchain = 100
  
  for (k in 1:nsim)
  {
    #record the cumulative discounted reward
    cum_reward[k, 1] = simulate_MDP(nchain, mypolicy, P1, P2, r, gamma, 1)$cum_reward
    cum_reward[k, 2] = simulate_MDP(nchain, mypolicy, P1, P2, r, gamma, 2)$cum_reward
    cum_reward[k, 3] = simulate_MDP(nchain, mypolicy, P1, P2, r, gamma, 3)$cum_reward
  }
  
  # simulated mean value function
  colMeans(cum_reward)
## [1] 15.47050 11.72851 14.51275

Reference


  1. We can iteratively apply the operator to any arbitrary starting value of \(\cV = x\), i.e., \((\cT^\pi)^n v\). This can then be re-organized into \((\bI - (\gamma P^\pi)^n ) (\bI - \gamma P^\pi)^{-1} \br + (\gamma P^\pi)^n x\), and since \((\gamma P^\pi)^n \rightarrow 0\), we can obtain the same solution.↩︎