Recall that in the policy iteration algorithm, the general strategy is to iteratively perform two things: evaluate the value of a policy and improve the policy. But still, the focus is more on having an accurate value (or action-value) function. In this section we explore another class of algorithms that focus more on the policy itself, meaning that we will directly parameterize the policy and try to optimize it. Of course this will still require estimations of the value function, but they are mainly serving as a tool to improve the policy. To give some general idea for the policy-based methods, following Sutton and Barto (2018), we can parameterize the policy as \(\pi(a | s; \theta)\), where \(\theta\) is the parameter of the policy. This will also lead to an estimation of some performance measure such as the value function, and we can update the policy by iteratively update the parameter \(\theta\) through
\[ \theta \leftarrow \theta + \alpha \nabla V(\theta) \]
where \(\alpha\) is the step size. In this section, we will introduce some of the most popular policy-based methods includes the REINFORCE algorithm, the actor-critic method, and the soft actor-critic method.
The policy gradient theorem is a fundamental result in policy-based methods. It provides a way to compute the gradient of the expected return with respect to the policy parameter. Let’s analyze the gradient of the value function with respect to the policy parameter.
\[ \begin{aligned} \ptheta V^\theta(s) &= \ptheta \E_{a \sim \pi} \left[ Q^\pi(s, a) \right] \\ &= \ptheta \int \pi(a | s; \theta) Q^\pi(s, a) da \\ &= \underbrace{\int_a \alertr{\ptheta \pi(a | s; \theta)} Q^\pi(s, a) da}_\text{I} + \underbrace{\int_a \pi(a | s; \theta) \alertr{\ptheta Q^\pi(s, a)} da}_\text{II} \qquad \text{(chain rule)} \end{aligned} \tag{1} \]
The first term is easy to handle since we explicitly know and parameterized the policy. The second term is more challenging since we do not know how to directly take derivative on the Q function, which also depends on the policy. We are going to unroll the second term and iteratively apply the chain rule:
\[ \begin{aligned} &\quad\, \int_a \pi(a | s; \theta) \alertr{\ptheta} \Big[ r(s, a) + \alertr{ \gamma \E_{s'} \big[ V^\theta(s') \big] } \Big] da \\ &= \alertr{\gamma} \int_a \pi(a | s; \theta) \alertr{\int_{s'} \pr(s' | s, a) \ptheta V^\theta(s') ds'} da \\ &= \gamma \int_a \pi(a | s; \theta) \int_{s'} \pr(s' | s, a) \ptheta \left[ \int_{a'} \pi(a' | s'; \theta) Q^\pi(s', a') da' \right] ds' da \qquad \text{(definition of $V$)} \\ &= \gamma \int_a \pi(a | s; \theta) \int_{s'} \pr(s' | s, a) \left[ \int_{a'} \alertr{\ptheta \pi(a' | s'; \theta)} Q^\pi(s', a') da' + \pi(a' | s'; \theta) \alertr{\ptheta Q^\pi(s', a')} \right] ds' da \qquad \text{(chain rule)} \\ &= \gamma \int_a \pi(a | s; \theta) \int_{s'} \pr(s' | s, a) \underbrace{\int_{a'} \alertr{\ptheta \pi(a' | s'; \theta)} Q^\pi(s', a') da'}_\text{same as Part (I), change (s', a') to (s, a)} ds' da \\ &\quad\, + \gamma \int_a \pi(a | s; \theta) \int_{s'} \pr(s' | s, a) \underbrace{\int_{a'} \pi(a' | s'; \theta) \alertr{\ptheta Q^\pi(s', a')} da'}_\text{same as Part (II)} ds' da \\ &= \cdots \qquad \qquad \text{(recursively unroll $\ptheta Q^\pi$)}\\ &= \gamma \int_x \pr(\text{from $s$ to $x$ in 1 step}) \int_{a} \ptheta \pi(a | x; \theta) Q^\pi(x, a) da dx \\ &\quad\, + \gamma^2 \int_x \pr(\text{from $s$ to $x$ in 2 steps}) \int_{a} \ptheta \pi(a | x; \theta) Q^\pi(x, a) da dx \\ &\quad\, + \cdots \end{aligned} \]
Combine that with the first term, and realizing that the first term is based on the marginal probably of transitioning to \(x\) in \(0\) step, we can write the gradient of the value function as
\[ \ptheta V^\theta(s) = \int_x \sum_{k = 0}^\infty \gamma^k \, \pr(\text{from $s$ to }\alertr{x}\text{ in $k$ steps}) \bigg[ \int_a \ptheta \pi(a | \alertr{x}; \theta) Q^\pi(\alertr{x}, a) da \bigg] dx \]
Let’s define
\[ \eta^\pi(s) = \sum_{t=0}^\infty \gamma^t \pr(s_t = s | s_0) \]
and property normalize it to be a probability distribution. This is the state visitation distribution:
\[ \mu^\pi(s) = \frac{\eta^\pi(s)}{\int_{s'} \eta^\pi(s') ds'} \]
Then we can write the gradient of the value function as
\[ \ptheta V^\theta(s_0) \propto \sum_s \mu^\pi(s) \int_a \ptheta \pi(a | s; \theta) Q^\pi(s, a) da \tag{2} \]
It is important to note that \(\mu^\pi(s)\) is an on-policy distribution, meaning that it is the distribution of states when the agent is following the policy \(\pi\). Hence, we will not be able to obtain this distribution if we are following a different policy. However, this result still provides a way to estimate the gradient of the value function without worrying about the derivative of the Q function. And in an on-policy setting, we can directly estimate the gradient of the value function by sampling from the policy. This is what the REINFORCE algorithm does.
Proceed one more step from Equation (2), we can write the gradient of the value function as
\[ \ptheta V^\theta(s_0) \propto \E_\pi \left[ \int_a Q^\pi(s, a) \ptheta \pi(a | s; \theta) da \right] \]
with an updating scheme when observing one observation \(S_t\)
\[ \theta_\text{new} \leftarrow \theta_\text{old} + \alpha \int_a \widehat{Q}^\pi(S_t, a) \ptheta \pi(a | S_t; \theta) da \]
where \(\widehat{Q}^\pi(s, a)\) is the current estimate of the Q function. This would be called an all-action version of the policy gradient. However, in this case, the Q-function still need to be estimated, and it is not really convenient if we need to perform an integration of \(a\). Hence another version of this algorithm was proposed by Williams (1992), which is the classical REINFORCE algorithm.
\[ \begin{aligned} &\quad\, \E_\pi \left[ Q^\pi(s, a) \ptheta \pi(a | s; \theta) \right] \\ &= \E_\pi \E_a \left[ Q^\pi(s, a) \ptheta \log \pi(a | s; \theta) \right] \end{aligned} \]
And we can update the parameters using one realization of the sample \((S_t, A_t)\)
\[ \theta_\text{new} \leftarrow \theta_\text{old} + \alpha \widehat{Q}^\pi(S_t, A_t) \ptheta \log \pi(A_t | S_t; \theta) \]
Hence, this is a stochastic gradient ascent algorithm. The algorithm works as follows. Note that there is a small difference of factor \(\gamma^t\) if we want to use a single sequence of data to repeatedly update the policy.
In fact equation (2) can be modified with an arbitrary baseline function, which is a function of the state \(b(s)\). This is similar to the mean function adjustment of the outcome weighted learning methods, as long \(b(s)\) is not a function of the action. The gradient of the value function can be written as
\[ \ptheta V^\theta(s_0) \propto \sum_s \mu^\pi(s) \int_a \ptheta \pi(a | s; \theta) \Big[ Q^\pi(s, a) - b(s) \Big] da \]
And a similar version of the REINFORCE algorithm one-sample update would be
\[ \theta_\text{new} \leftarrow \theta_\text{old} + \alpha \Big[ Q(S_t, A_t) - b(S_t) \Big] \ptheta \log \pi(A_t | S_t; \theta) \]
Hence, one idea is to take \(b(s)\) as the value function \(V^\pi(s)\). Furthermore, we can write our the Bellman equation for the \(Q(S_t, A_t)\) part, which leads to
\[ Q(S_t, A_t) - b(S_t) = R_t + \gamma V^\pi(S_{t+1}) - V^\pi(S_t) = \delta_t \]
which is exactly the TD error. Hence, we can use the TD error \(\delta_t\) we introduced previously. This would require us to parameter the value function as well. Hence, it is some type of value iteration, but with the policy part parameterized and updated iteratively.
When fitting a reinforcement learning model, we often want to balance between exploration and exploitation. We have seen the impact of this in various occasions, even in the context of causal inference. For example, the positivity assumption in estimating the average treatment effect requires that the behavior policy would have a chance to explore all the treatment options at any covariate value, and context of reinforcement learning, we want to make sure that the agent would explore all the state values, so that we can observe their rewards. As we can see from the previous derivation, we would essentially require that the marginal visitation distribution \(\mu^\pi(s)\) to be non-zero for all states. And if given the fact that the policy is stochastic, we can also require that the marginal visitation density is nonzero for any state action pair \((s, a)\).
Hence, the concern here can be applied to both off-line and on-line settings. In the on-line setting, we can directly control the policy and make sure that the policy is stochastic enough. Hence, this requires us to design the updating algorithm in a ways that contains sufficient randomness. In the off-line setting, the behavior policy may not contain enough randomness. This makes our estimation of the optimal policy risky in the sense that we may conclude that a policy has high value, but it is only because there are a few samples that are very high or there is almost no sample, but the functional approximation makes that value very high.
Let’s first look at an example that is more suitable for online setting. We can add a term to the objective function that penalizes the entropy of the policy. This is called the entropy regularized actor-critic method or the soft actor-critic proposed by Haarnoja et al. (2018). The objective function is
\[ J(\theta) = \E_\pi \left[ \sum_{t=0}^T \gamma^t \Big[ R_t + \alpha \cH \big(\pi(\cdot | S_t) \big) \Big] \right] \]
where \(\alpha\) is a entropy regularization parameter, and \(\cH\) is the entropy function. The entropy of a discrete distribution is defined as
\[ \begin{aligned} \cH(\pi) &= - \E_{a_t \sim \pi} \left[ \log \pi(a_t | s_t) \right] \\ &= - \int_a \pi(a | s) \log \pi(a | s) da \end{aligned} \]
Since our goal is to maximize the objective function \(J(\theta)\) by choosing the policy parameter \(\theta\), this will involves both the discounted sum of rewards and also the entropy of the policy. This formula encouraging the policy to be more stochastic hence leads to more exploration. The diffidence between this the standard \(\epsilon\)-greedy method is that the \(\epsilon\)-greedy method still learns a deterministic policy but only utilize the stochastic policy for exploration. In contrast, the entropy regularized actor-critic method learns a stochastic policy. As a consequence, the entropy regularized actor-critic method changes the traditional objective function of MDP. We can see that
\[ V(s_t) = \E_{a_t \sim \pi} \big[ Q_\theta(s_t, a_t) - \alpha \log (\pi(\cdot | s_t)) \big] \]
Hence, we can define a new Bellman operator, which essentially extends the traditional Bellman operator.
\[ \begin{aligned} \cT Q(s_t, a_t) &= R_t + \gamma \E_{s_{t+1}} \big[ V(s_{t+1}) \big] \\ &= R_t + \gamma \E_{s_{t+1}} \big[ \E_{a} \big[ Q(s_{t+1}, a) - \alpha \log (\pi( a | s_{t+1})) \big] \big] \\ \end{aligned} \]
The interesting thing about is this is that we may not always binds to the traditional Bellman equation and solve for the optimal policy. As long as we define our objective function or the new Bellman operator, we can manipulate the reinforcement learning problem in various ways. An interesting problem of this is the concept of pessimism (Buckman, Gelada, and Bellemare 2020), which is to design a safer policy. This can be done by controlling the worst case scenario of the reward function, or maybe regularizing the policy to be more conservative. For example, we can penalize the estimated policy towards the behavior policy (Xu et al. 2021), or towards a more sparse policy (Zhou, Zhu, and Qu 2024).