Policy Iteration

1 Overview

In our previous lectures, we developed the foundations of reinforcement learning through the Markov Decision Process framework, the Bellman equation, and the contraction properties that justify dynamic programming. We introduced several algorithms such as fitted Q-iteration and policy gradient. These two perspectives highlight different sides of the same problem: one learns (action) values and extracts a policy afterward, while the other updates the policy directly without solving a value equation.

In this lecture, we connect these perspectives by introducing policy iteration, a classical dynamic programming method that alternates between evaluating a policy and improving it. This approach provides a clean theoretical framework for understanding how value functions and policies interact, and why iterative improvement converges to an optimal policy. However, the classical greedy improvement step becomes problematic in continuous action spaces, where computing an exact maximizer of \(Q(s,a)\) is generally intractable and unstable under function approximation.

These limitations motivate the introduction of Soft Actor-Critic (SAC, Haarnoja et al. (2018)), which can be viewed as a soft, entropy-regularized variant of policy iteration. By adding a maximum-entropy term to the objective, the hard greedy update is replaced by a smooth Boltzmann-type improvement that is computationally tractable and naturally compatible with continuous actions. This soft formulation also enhances exploration and stabilizes learning. Moreover, because SAC is an off-policy method, it can leverage fixed datasets, similar to a fitted Q-iteration approach, making it suitable for offline RL settings.

2 Classical Policy Iteration

In its classical dynamic programming form, policy iteration alternates between two steps (at iteration \(k\)):

Policy Evaluation: Given a fixed policy \(\pi_k\), compute its value function \(V^{\pi_k}\) (or equivalently \(Q^{\pi_k})\) by solving the Bellman equation for that policy. In the tabular finite-MDP setting, this can be done exactly by solving a system of linear equations. In other settings when exact solution is not avaliable, this can be done by iterating the Bellman operator until convergence.
Policy Improvement: Using the evaluated value function, construct a new policy
\[ \pi_{k+1}(s) \in \arg\max_a Q^{\pi_k}(s,a), \] which chooses, in each state, an action that appears best under the current estimate.

These two steps are repeated until the policy no longer changes. It should be noted that a policy iteration is different from value iteration or the fitted Q-iteration algorithm we discussed before.¹

A key theoretical result, called the Policy Improvement Theorem guarantees that each improvement step never decreases the value of the policy, and therefore the procedure converges in a finite number of iterations to an optimal policy.

3 Example

When evaluating the policy, we can again use the Bellman equation to iteratively update the Q-function until convergence. Here is the example we used in the previous lecture for value iteration. Let’s modify the algorithm to perform policy iteration. Recall that the strategy is to get back to stage 1 as fast as possible. The transition probability matrices for action 1 and action 2 are given as follows:

\[ P_1 = \begin{pmatrix} 0.8 & 0.1 & 0.1 \\ 0.05 & 0.05 & 0.9 \\ 0.2 & 0.2 & 0.6 \end{pmatrix} \]

\[ P_2 = \begin{pmatrix} 0.5 & 0.25 & 0.25 \\ 0.1 & 0.8 & 0.1 \\ 0.8 & 0.1 & 0.1 \end{pmatrix} \]

Our immediate reward function is

\[ r = \begin{pmatrix} 5 & 3 \\ 2 & 2.5 \\ 3 & 2 \end{pmatrix} \]

  # Transition matrices and reward matrix as before
  P1 = matrix(c(0.8, 0.1, 0.1,
                0.05, 0.05, 0.9,
                0.2, 0.2, 0.6), byrow=TRUE, nrow=3)

  P2 = matrix(c(0.5, 0.25, 0.25,
                0.1, 0.8, 0.1,
                0.8, 0.1, 0.1), byrow=TRUE, nrow=3)

  # define the expected reward for each state and each action
  r = matrix(c(5, 3,
               2, 2.5,
               3, 2), byrow=TRUE, nrow=3)

  gamma = 0.7
  
  # start from some arbitrary policy and Q function
  use_policy = c(2, 2, 1)
  Q = matrix(1, nrow=3, ncol=2)
  
  niter = 2 # number of policy improvement iterations
  n_eval_iter = 30 # （inner loop) number of evaluation iterations
  
  # policy iteration
  
  for (k in 1:niter)
  {
    # We need to first find the policy value of the current policy
    # This involves a further iteration
    # recall our bellman equation for any policy
    for (j in 1:n_eval_iter)
    {
      # We will us the current estimate of the Q as a starting point
      # the corresponding value
      V = Q[cbind(1:3, use_policy)]
      
      # Use the Bellman Equation to update the Q function
      Q[, 1] = r[, 1] + gamma * P1 %*% V
      Q[, 2] = r[, 2] + gamma * P2 %*% V
    }
    
    # up to here, we have a good estimation of the Q function
    # We can then update the policy
    use_policy = apply(Q, 1, which.max)
    
    # print some results
    SummaryMat = cbind("Policy" = use_policy,
                       "Q Function" = Q)
    
    if (k <= 3 | k == niter)
    {
      if (k == niter) cat("\n--------------\n\n")
      cat(paste("Iteration", k, "\n"))
      print(SummaryMat)
    }
  }
## Iteration 1 
##      Policy                   
## [1,]      1 11.625997 9.543939
## [2,]      2  8.675954 8.724445
## [3,]      1  9.581929 8.625997
## 
## --------------
## 
## Iteration 2 
##      Policy                    
## [1,]      1 14.681706 11.788493
## [2,]      1  9.834572  9.779745
## [3,]      2 11.076879 11.681706

We can see that after just 2 iterations, the policy has already converged to the optimal policy. This is much faster than value iteration (at least at the outer loop for changing the policy). But the downside could be that policy iteration is more computationally expansive since we need to evaluate the Q-function (hopefully to get an accurate estimate of the current policy). In some sense, we can view the policy iteration as a block-wise update in which we update the parameters of Q function just one step towards its better estimate with the Bellman update, and the then update the policy to be greedy w.r.t the updated Q-function. But for the policy iteration, we fully optimize the Q-function given the current policy before updating the policy.

4 Policy Improvement Theorem

The policy iteration algorithm is guaranteed to converge to the optimal policy. This is because the policy improvement step is a greedy step, and the policy evaluation step is an exact step.

(Policy Improvement Theorem) Let \(\pi_0\) be some stationary policy. Let \(\pi\) be the policy obtained by any policy improvement step applied to \(\pi_0\), i.e., it is greedy w.r.t \(V^\pi_0\):

\[ \cT^\pi V^{\pi_0} = \cT^\ast V^{\pi_0} \]

Then \(\pi\) is an improvement upon \(\pi_0\), i.e.,

\[ V^\pi \geq V^{\pi_0} \]

The proof of this theorem is straightforward. Since we know that the policy improvement step is greedy, we have

\[ \cT^\pi V^{\pi_0} = \cT^\ast V^{\pi_0} \geq \cT^{\pi_0} V^{\pi_0} = V^{\pi_0}. \]

The first equation is due to the definition of the greedy policy, that we are picking the best action w.r.t the value function of \(\pi_0\). Hence, the Bellman corresponding to \(\pi\) is the same as the Bellman optimality operator applied to \(V^{\pi_0}\). The inequality is due to the fact that the Bellman optimality operator always gives a value no less than any other policy’s Bellman operator when applied to the same value function, simply because it picks the best action given the current value function \(V^{\pi_0}\). The last equality is due to the fact that \(V^{\pi_0}\) is the fixed point of the Bellman operator corresponding to policy \(\pi_0\).

Now if we further apply \(\cT^\pi\) on both sides, we have

\[ (\cT^\pi)^2 V^{\pi_0} \geq \cT^\pi V^{\pi_0} \geq V^{\pi_0}, \]

and repeatedly applying this leads to

\[ (\cT^\pi)^n V^{\pi_0} \geq V^{\pi_0}. \]

However, since we know from the property of the Bellmen operator that \(\cT^\pi\) is a contraction mapping, for any value function, as long as we iteratively apply \(\cT^\pi\), we will obtain the policy value of \(\pi\). Hence, we have

\[ V^{\pi} = (\cT^\pi)^n V^{\pi_0} \geq V^{\pi_0}. \]

5 On-Policy vs. Off-Policy

The classical policy iteration framework provides a principled way to improve a policy through repeated evaluation and greedy updates, but it implicitly assumes an on policy setting because each iteration requires an accurate evaluation \(V^{\pi_k}\), which is typically done with online and on-policy data. For example, one can collect driving data by following the current driving policy algorithm \(\pi_k\) and use that data to estimate the value function. However, in other problems such as a medical treatment recommendation system, it may be costly or often times unethical to deploy a new policy for data collection at each iteration. Then we are facing an off-policy evaluation problem, i.e., we are evaluating a policy \(\pi\) using data collected by another (behavior) policy \(\mu\). Soft Actor Critic (SAC) belongs to this class. It can be viewed as a soft entropy regularized version of policy iteration that replaces the hard greedy update with a smooth improvement step and uses an offline dataset to perform off policy evaluation and learning. The key advantage is that if the behavior policy that generated the data is reasonably close to the optimal policy or at least covers the important state action regions, then SAC can learn a good policy without requiring on policy samples. This can a reasonable assumption when we are trying to improve an existing treatment plan if we only want to learn a new plan that does slight twists on the existing one. Hence this makes SAC particularly appealing in some offline RL settings.

6 Maximum Entropy Formulation

Before introducing Soft Actor Critic, we first review the idea of using entropy in a reinforcement learning objective. In the standard framework, the goal is to maximize the expected cumulative reward. In the maximum entropy formulation, we instead consider an augmented objective that adds an entropy term to encourage the policy to remain stochastic. Formally, the objective

\[ J(\pi) = \mathbb{E}[V^{\pi}(s)] \]

is replaced with the discounted maximum-entropy return

\[ J(\pi) =\E_{\pi} \Big[ \sum_{t=0}^{\infty} \gamma^t \big( r(s_t,a_t) + \alpha\,\cH(\pi(\cdot|s_t)) \big) \Big], \]

where

\[ \cH(\pi(\cdot|s)) = -\sum_{a}\pi(a|s)\log\pi(a|s) \]

Note that here the entropy term is not a function of the actually observed action \(a_t\), but rather a function of the policy distribution at state \(s_t\), represented by the \(\cdot\) notation. Here, if we write it into the expectation \(\E_\pi\), we should have

\[ -\sum_a \pi(a|s)\log\pi(a|s) = -\E_{a\sim\pi(\cdot|s)} [ \log\pi(a|s) ]. \]

Substituting this into the maximum-entropy return, we can re-define the return as

\[ J(\pi) =\E_{\pi} \Big[ \sum_{t=0}^{\infty} \gamma^t \big( r(s_t,a_t) - \alpha \log\pi(a_t|s_t) \big) \Big], \]

with the immediate reward at time \(t\) defined as

\[ r(s_t,a_t) + \alpha\mathcal{H}(\pi(\cdot|s_t)) = r(s_t,a_t) - \alpha\log\pi(a_t|s_t), \]

Under this objective, the value function becomes a “soft” version of the value

\[ V^\pi(s) = \E_{a\sim\pi(\cdot|s)} \big[ Q^\pi(s,a) - \alpha \log \pi(a|s) \big], \]

and the corresponding soft Q-function is defined as

\[ Q^\pi(s,a) = r(s,a) + \gamma\,\E_{s'\sim P(\cdot|s,a)}\left[ V^\pi(s') \right]. \]

These definitions incorporate entropy directly into the Bellman recursion and thus reflect the discounted maximum-entropy objective above. In addition, the optimal soft value function satisfies the soft Bellman optimality equation:

\[ \begin{aligned} V^\ast(s) =& \max_{\pi(\cdot|s)} \; \E_{a\sim\pi(\cdot|s)} \big[ Q^\ast(s,a) - \alpha \log \pi(a|s) \big] \\ =& \max_{\pi(\cdot|s)} \;\E_{a\sim\pi(\cdot|s)} \big[ Q^\ast(s,a) \big] + \alpha\,\cH(\pi(\cdot|s)). \end{aligned} \]

A key consequence of this formulation is that the optimal policy at state \(s\) admits a closed-form solution of Boltzmann type, as we will derive next.

7 The Boltzmann Policy

We now derive the optimal policy at a particular state \(s\). Suppose we have access to the soft Q function \(Q(s,a)\), which already incorporates the discount factor through the soft Bellman equation. The goal is to choose a distribution \(\pi(\cdot|s)\) that maximizes the soft value function:

\[ \begin{aligned} \max_{\pi(\cdot|s)} & \quad \E_{a\sim\pi(\cdot|s)} \big[ Q(s,a) \big] \;+\; \alpha\,\cH(\pi(\cdot|s)) \\[2mm] \text{subject to} & \quad \sum_{a\in\cA}\pi(a|s)=1. \end{aligned} \]

To solve this optimization problem, we write the Lagrangian

\[ \cL(\pi,\lambda) = \sum_{a}\pi(a) Q(s,a) - \alpha\sum_{a}\pi(a)\log\pi(a) - \lambda\Big(\sum_{a}\pi(a)-1\Big), \]

where \(\lambda\) is the Lagrange multiplier that enforces the normalization constraint. Taking the derivative with respect to \(\pi(a)\) and setting it to zero yields

\[ Q(s,a) - \alpha(1+\log\pi(a)) - \lambda = 0. \]

Solving for \(\log\pi(a)\),

\[ \log\pi(a) = \frac{1}{\alpha} Q(s,a) - \Big(\frac{\lambda}{\alpha} + 1\Big). \]

Exponentiating both sides gives

\[ \pi(a) = \exp\!\Big(\frac{1}{\alpha}Q(s,a)\Big) \cdot \exp\!\Big(-\frac{\lambda}{\alpha}-1\Big). \]

The second factor is a constant that does not depend on \(a\) and serves as a normalization factor. Enforcing \(\sum_a \pi(a|s)=1\) gives

\[ \exp\!\Big(-\frac{\lambda}{\alpha}-1\Big) = \frac{1}{Z(s)}, \qquad Z(s) = \sum_{a\in\cA}\exp\!\Big(\frac{1}{\alpha}Q(s,a)\Big). \]

Thus, the optimal soft policy is

\[ \pi^{\ast}(a|s) = \frac{\exp\!\big(\frac{1}{\alpha} Q(s,a)\big)}{Z(s)}. \]

In practice we do not need to solve for the normalization constant \(Z(s)\) explicitly, since knowing the Q values up to an additive constant is sufficient for specifying the policy distribution.

8 Soft Actor-Critic Algorithm

The SAC algorithm can be viewed as a soft, entropy-regularized variant of policy iteration. In practice, we do not directly compute the exact soft policy based on the Q function. This is because we would then require calculating the Q function entirely for any new state \(s\) to compute \(Q\) and the normalization constant \(Z(s)\) by calculating the integration. When the Q function is rather complicated, such as a neural network, this can be computationally expensive. Instead, we parameterize the policy as a parametric distribution class \(\Pi = \{\pi_\phi\}\) (e.g., Gaussian policies with neural network mean and variance) and perform a projection step to find the closest policy in \(\Pi\) to the optimal soft policy derived above. This in some sense solves for the best soft policy within the class of parameterized class of policies that we are considering.

However, the next question is how do we evaluate what is “close” when we parameterize our policy? Here we use the KL divergence between our estimated policy and the soft policy induced from the Q function given above. Using this, SAC defines the improved policy as

\[ \pi_{\phi}^\text{new} = \arg\min_{\pi_\phi \in \Pi} \text{KL}\!\left( \pi_\phi(\cdot|s) \;\Vert\; \frac{ \exp(Q(s,\cdot)/\alpha) }{ Z(s) } \right). \]

Recall that the KL divergence between two distributions \(P\) and \(Q\) is defined as

\[ \begin{aligned} \text{KL}(P\|Q) &= \int P(x)\log\frac{P(x)}{Q(x)}\,dx \\ &= \E_{x\sim P} \big[\log P(x) - \log Q(x)\big]. \end{aligned} \]

Hence, for our case, the KL divergence gives

\[ \text{KL}(\pi_\phi \| \pi^{\ast}) = \E_{a\sim\pi_\phi} \bigg[ \log\pi_\phi(a|s) - \frac{1}{\alpha}Q(s,a) + \log Z(s) \bigg]. \]

Since \(\log Z(s)\) is constant with respect to \(\phi\), it does not affect the minimizer. Hence, the policy improvement step can be written as

\[ \pi_{\phi}^\text{new} = \arg\min_{\pi_\phi \in \Pi} \E_{a\sim\pi_\phi} \big[ \log\pi_\phi(a|s) - \frac{1}{\alpha}Q(s,a) \big]. \]

Since \(\pi_\phi\) is parameterized by \(\phi\), we can use a typical descent approach to update the parameters \(\phi\) to minimize the objective. This gives us the policy improvement step in SAC. We can then update the soft-Q function using the soft Bellman equation as the policy evaluation step. This step is essentially the same as the fitted Q-iteration we discussed before, except that we are using the soft Bellman equation here. Suppose \(Q\) is parameterized by \(\theta\), then the loss function for updating \(\theta\) is given as

\[ \cL_Q(\theta) = \E_{(s,a,r,s')\sim\cD} \big[ \big( Q_\theta(s,a) - y \big)^2 \big]. \]

where the target \(y\) is given by the soft Bellman equation

\[ y = r + \gamma\,\E_{a'\sim\pi_\phi(\cdot|s')} \big[ Q_{\bar\theta}(s',a') - \alpha \log \pi_\phi(a'|s') \big], \]

In summary, the SAC algorithm alternates between these two steps until convergence:

Policy Evaluation: Update the soft Q-function parameters \(\theta\) by minimizing the loss \(\cL_Q(\theta)\) using the collected data.
Policy Improvement: Update the policy parameters \(\phi\) by minimizing the KL divergence objective using the current soft Q-function.

9 Relate to Behavior Data

In practice, SAC is an off-policy algorithm that can leverage a fixed dataset collected by some behavior policy \(\mu\). The key requirement is that the behavior policy should provide sufficient coverage of the state-action space relevant to the optimal policy. Then, the expected values in both the policy evaluation and improvement steps can be estimated using samples from this fixed dataset. This requirement is the same for other off-policy algorithms such as fitted Q-iteration and it allows SAC to learn an effective policy without requiring on-policy samples at each iteration, making it suitable for offline RL settings.

10 Other Regularization Forms

In practice, we may also be interested in regularizing the policy to stay close to the behavior policy, rather than (or in addition to) maximizing entropy. This is useful in settings where we want to ensure that the learned policy does not deviate too far from a known safe or baseline policy, such as in offline RL or safety-critical applications. In this case, the policy improvement step can be modified to include a KL divergence penalty that encourages the new policy to remain close to the prior/behavior policy (Wu, Tucker, and Nachum 2019; Fujimoto, Meger, and Precup 2019; Kumar et al. 2019).

Fujimoto, Scott, David Meger, and Doina Precup. 2019. “Off-Policy Deep Reinforcement Learning Without Exploration.” In International Conference on Machine Learning, 2052–62. PMLR.

Haarnoja, Tuomas, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” In International Conference on Machine Learning, 1861–70. PMLR.

Kumar, Aviral, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. 2019. “Stabilizing Off-Policy q-Learning via Bootstrapping Error Reduction.” Advances in Neural Information Processing Systems 32.

Wu, Yifan, George Tucker, and Ofir Nachum. 2019. “Behavior Regularized Offline Reinforcement Learning.” arXiv Preprint arXiv:1911.11361.

In value iteration, we update the value function directly using the Bellman optimality equation. However, the updated value does not correspond to any policy until the very end when we extract the policy by taking the greedy action w.r.t the final value function. Hence, in some sense, the parameters of the value and the policy are being updated alternatively, but neither is fully optimized given the other at each step. In fitted Q-iteration, we update the Q-function directly using the Bellman optimality equation, and similarly, the updated Q-function does not correspond to any policy until the very end when contraction is achieved.↩︎