\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\cA}{\mathcal{A}}\) \(\newcommand{\cB}{\mathcal{B}}\) \(\newcommand{\cC}{\mathcal{C}}\) \(\newcommand{\cD}{\mathcal{D}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cF}{\mathcal{F}}\) \(\newcommand{\cG}{\mathcal{G}}\) \(\newcommand{\cH}{\mathcal{H}}\) \(\newcommand{\cI}{\mathcal{I}}\) \(\newcommand{\cJ}{\mathcal{J}}\) \(\newcommand{\cK}{\mathcal{K}}\) \(\newcommand{\cL}{\mathcal{L}}\) \(\newcommand{\cM}{\mathcal{M}}\) \(\newcommand{\cN}{\mathcal{N}}\) \(\newcommand{\cO}{\mathcal{O}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\cQ}{\mathcal{Q}}\) \(\newcommand{\cR}{\mathcal{R}}\) \(\newcommand{\cS}{\mathcal{S}}\) \(\newcommand{\cT}{\mathcal{T}}\) \(\newcommand{\cU}{\mathcal{U}}\) \(\newcommand{\cV}{\mathcal{V}}\) \(\newcommand{\cW}{\mathcal{W}}\) \(\newcommand{\cX}{\mathcal{X}}\) \(\newcommand{\cY}{\mathcal{Y}}\) \(\newcommand{\cZ}{\mathcal{Z}}\) \(\newcommand{\bA}{\mathbf{A}}\) \(\newcommand{\bB}{\mathbf{B}}\) \(\newcommand{\bC}{\mathbf{C}}\) \(\newcommand{\bD}{\mathbf{D}}\) \(\newcommand{\bE}{\mathbf{E}}\) \(\newcommand{\bF}{\mathbf{F}}\) \(\newcommand{\bG}{\mathbf{G}}\) \(\newcommand{\bH}{\mathbf{H}}\) \(\newcommand{\bI}{\mathbf{I}}\) \(\newcommand{\bJ}{\mathbf{J}}\) \(\newcommand{\bK}{\mathbf{K}}\) \(\newcommand{\bL}{\mathbf{L}}\) \(\newcommand{\bM}{\mathbf{M}}\) \(\newcommand{\bN}{\mathbf{N}}\) \(\newcommand{\bO}{\mathbf{O}}\) \(\newcommand{\bP}{\mathbf{P}}\) \(\newcommand{\bQ}{\mathbf{Q}}\) \(\newcommand{\bR}{\mathbf{R}}\) \(\newcommand{\bS}{\mathbf{S}}\) \(\newcommand{\bT}{\mathbf{T}}\) \(\newcommand{\bU}{\mathbf{U}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\ba}{\mathbf{a}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\bd}{\mathbf{d}}\) \(\newcommand{\be}{\mathbf{e}}\) \(\newcommand{\bg}{\mathbf{g}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bi}{\mathbf{i}}\) \(\newcommand{\bj}{\mathbf{j}}\) \(\newcommand{\bk}{\mathbf{k}}\) \(\newcommand{\bl}{\mathbf{l}}\) \(\newcommand{\bm}{\mathbf{m}}\) \(\newcommand{\bn}{\mathbf{n}}\) \(\newcommand{\bo}{\mathbf{o}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bq}{\mathbf{q}}\) \(\newcommand{\br}{\mathbf{r}}\) \(\newcommand{\bs}{\mathbf{s}}\) \(\newcommand{\bt}{\mathbf{t}}\) \(\newcommand{\bu}{\mathbf{u}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\hpi}{\widehat{\pi}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\hbpi}{\widehat{\boldsymbol{\pi}}}\) \(\newcommand{\bxi}{\boldsymbol{\xi}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bzero}{\mathbf{0}}\) \(\newcommand{\T}{\text{T}}\) \(\newcommand{\Trace}{\text{Trace}}\) \(\newcommand{\Cov}{\text{Cov}}\) \(\newcommand{\Var}{\text{Var}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\Pr}{\text{Pr}}\) \(\newcommand{\pr}{\text{pr}}\) \(\newcommand{\pdf}{\text{pdf}}\) \(\newcommand{\P}{\text{P}}\) \(\newcommand{\p}{\text{p}}\) \(\newcommand{\One}{\mathbf{1}}\) \(\newcommand{\argmin}{\operatorname*{arg\,min}}\) \(\newcommand{\argmax}{\operatorname*{arg\,max}}\)

Overview

There are many algorithms for policy learning in reinforcement learning. However, almost all of them would utilize the Bellman equation.

Recap on Bellman Equation

We have seen the Bellman equation for the value function and Q-function, they are essentially based on the same logic. For any policy \(\pi\), the Bellman equation is given by (assuming a finite state space and a deterministic policy)

\[ \begin{aligned} \cV^\pi &= \E_a \big[ r(s, a) \big] + \gamma \E_{s'} \big[ \cV^\pi(s') \big] \\ &=\sum_a \pi(a | s) r(s, a) + \gamma \sum_a \sum_{s'} \P(s' | s, a) \pi(a | s) \cV^\ast(s') \end{aligned} \]

The Bellman optimality equation is the version for the optimal policy

\[ \cV^\ast = \max_a \bigg\{ r(s, a) + \gamma \sum_{s'} P(s' | s, a) \cV^\ast(s') \bigg\} \]

On the other hand, the Bellman equation for the Q-function is given by

\[ \begin{aligned} Q^\pi(s, a) &= r(s, a) + \gamma \sum_{s'} P(s' | s, a) \cV^\pi(s')\\ &= r(s, a) + \gamma \sum_{s'} P(s' | s, a) \sum_{a'} \pi(a' | s') Q^\pi(s', a') \end{aligned} \]

While the Bellman optimality equation is

\[ Q^\ast(s, a) = r(s, a) + \gamma \sum_{s'} P(s' | s, a) \max_{a'} Q^\ast(s', a') \]

In all of the above equations, we view the reward function \(r(s, a)\) as an expectation of the immediate reward given \(s\) and \(a\), and the transition probability \(P(s' | s, a)\) is known.

Policy Iteration

We have already seen an example of the value iteration. Here, we discuss the policy iteration algorithm. The value iteration algorithm directly update the value of a policy based on the Bellman opimality equation. The policy iteration algorithm, on the other hand, consists of two steps and iteratively performs them until convergence:

  • Policy evaluation: calculate the Q-function of the current policy
  • Policy improvement: update (improve) the policy based on the estimated Q function of the current policy.

When evaluating the policy, we can again use the Bellman equation to iteratively update the Q-function until convergence. Here is the example we used in the previous lecture for value iteration. Let’s modify the algorithm to perform policy iteration. The optimal policy is to pick action 1 regardless of the state.

\[ P_1 = \begin{pmatrix} 0.8 & 0.1 & 0.1 \\ 0.05 & 0.05 & 0.9 \\ 0.8 & 0.1 & 0.1 \end{pmatrix} \]

\[ P_2 = \begin{pmatrix} 0.5 & 0.25 & 0.25 \\ 0.1 & 0.8 & 0.1 \\ 0.2 & 0.2 & 0.6 \end{pmatrix} \]

Our immediate reward function is

\[ r = \begin{pmatrix} 5 & 3 \\ 1.6 & 3 \\ 4 & 2 \end{pmatrix} \]

We can see that picking action 1 is better for stages 1 and 3 immediately, but not for stage 2. However, based on our transitional kernel design, pick action 1 would allow us to transit into stage 3, and then back to 1 faster and eventually attain a better reward.

  # Define the transition probability matrix and rewards 

  P1 = matrix(c(0.8, 0.1, 0.1,
                0.05, 0.05, 0.9,
                0.8, 0.1, 0.1), byrow=TRUE, nrow=3)

  P2 = matrix(c(0.5, 0.25, 0.25,
                0.1, 0.8, 0.1,
                0.2, 0.2, 0.6), byrow=TRUE, nrow=3)

  # define the expected reward for each state and each action
  r = matrix(c(5, 3,
               1.6, 3,
               4, 2), byrow=TRUE, nrow=3)
  
  # start with some arbitrary value function
  V = c(0, 0, 0)
  
  gamma = 0.7
  
  # start from some arbitrary policy and Q function
  use_policy = c(2, 2, 2)
  Q = matrix(1, nrow=3, ncol=2)
  
  
  niter = 10 # number of policy improvement iterations
  n_eval_iter = 30 # number of evaluation iterations
  
  # policy iteration
  
  for (k in 1:niter)
  {
    # We need to first find the policy value of the current policy
    # This involves a further iteration
    # recall our bellman equation for any policy
    for (j in 1:n_eval_iter)
    {
      # We will us the current estimate of the Q as a starting point
      # the corresponding value
      V = Q[cbind(1:3, use_policy)]
      
      # Use the Bellman Equation to update the Q function
      Q[, 1] = r[, 1] + gamma * P1 %*% V
      Q[, 2] = r[, 2] + gamma * P2 %*% V
    }
    
    # up to here, we have a good estimation of the Q function
    # We can then update the policy
    use_policy = apply(Q, 1, which.max)
    
    # print some results
    SummaryMat = cbind("Policy" = use_policy,
                       "Q Function" = Q)
    
    if (k <= 3 | k == niter)
    {
      if (k == niter) cat("\n--------------\n\n")
      cat(paste("Iteration", k, "\n"))
      print(SummaryMat)
    }
  }
## Iteration 1 
##      Policy                   
## [1,]      1 11.470239 9.353990
## [2,]      2  7.314622 9.581929
## [3,]      1 10.470239 8.018920
## Iteration 2 
##      Policy                  
## [1,]      1 15.51822 13.00146
## [2,]      1 11.69548 11.59666
## [3,]      1 14.51822 11.89371
## Iteration 3 
##      Policy                  
## [1,]      1 15.54058 13.03384
## [2,]      1 11.71449 11.66580
## [3,]      1 14.54058 11.92275
## 
## --------------
## 
## Iteration 10 
##      Policy                  
## [1,]      1 15.54058 13.03384
## [2,]      1 11.71449 11.66580
## [3,]      1 14.54058 11.92275

We can see that after just 2 iterations, the policy has already converged to the optimal policy. This is much faster than value iteration. But the downside is that policy iteration is more computationally expansive since we need to evaluate the Q-function (hopefully to get an accurate estimate of the current policy). In some sense, we can view the policy iteration as a block coordinate descent algorithm, where we first update the Q-function and then update the policy, but each one is updated to its current optimal value given the other. However, the value iteration is more like a coordinate descent algorithm, where we update the value function directly based on the Bellman optimality equation to get a better and better estimate of the optimal value function.

Convergence of Policy Iteration

The policy iteration algorithm is guaranteed to converge to the optimal policy. This is because the policy improvement step is a greedy step, and the policy evaluation step is an exact step. The following result is known as the Policy Improvement Theorem:

Theorem 1 (Policy Improvement Theorem) Let \(\pi_0\) be some stationary policy. Let \(\pi\) be the policy obtained by any policy improvement step applied to \(\pi_0\), i.e., it is greedy w.r.t \(\cV^\pi_0\):

\[ \cT^\pi \cV^{\pi_0} = \cT^\ast \cV^{\pi_0} \]

Then \(\pi\) is an improvement upon \(\pi_0\), i.e.,

\[ \cV^\pi \geq \cV^{\pi_0} \]

\(\square\)

Proof (Policy Improvement Theorem). The proof of this theorem is straightforward. Since we know that the policy imporment step is greedy, we have

\[ \cT^\pi \cV^{\pi_0} = \cT^\ast \cV^{\pi_0} \geq \cT^{\pi_0} \cV^{\pi_0} = \cV^{\pi_0}. \]

The last equality is due to the fact that the policy evaluation step exactly calculates the value of \(\pi_0\). Then if we further apply \(\cT^\pi\) on both sides, we have

\[ (\cT^\pi)^2 \cV^{\pi_0} \geq \cT^\pi \cV^{\pi_0} \geq \cV^{\pi_0}, \]

which leads to

\[ (\cT^\pi)^n \cV^{\pi_0} \geq \cV^{\pi_0}. \]

However, since we know from the property of the Bellmen operator that \(\cT^\pi\) is a contraction mapping, for any value function, as long as we iteratively apply \(\cT^\pi\), we will obtain the policy value of \(\pi\). Hence, we have

\[ (\cT^\pi)^n \cV^{\pi_0} = \cV^{\pi}. \]

Hence, \(\cV^{\pi} \geq \cV^{\pi_0}\). \(\square\)

Reference