reviously we discussed imodels that estimate the value or the Q functions, and then derive the optimal policy by taking the argmax over the action space. An alternative approach is to directly parameterize the policy \(\pi\), and optimize the policy parameters to maximize the expected cumulative return. This is known as the policy-based method. In this lecture, we will derive the policy gradient theorem, which provides an explicit expression for the gradient of the expected return with respect to the policy parameters. This theorem forms the basis for many modern reinforcement learning algorithms that directly optimize policies.
Recall that we can have a trajectory \(\bh_t\) that contains all the information of \(s_t, a_t\) up to time \(t\):
\[ \bh_t = (s_0, a_0, \dots, s_t, a_t, s_{t+1}). \]
Here we will ignore the reward for simplicity (think of them as deterministic functions of states and actions if you like). The density function of a trajectory \(\bh_t\) under a policy \(\pi\) can be written as
\[ p_\pi(\bh_t) = \mu_0(s_0) \prod_{i=0}^{t} \pi(a_i \mid s_i) P(s_{i+1} \mid s_i, a_i), \]
where \(\mu_0\) is the initial state distribution, \(P\) is the transition probability, and \(\pi\) is the policy used to generate the data. Our goal is to a good policy \(\pi\) that gives us high cumulative return.
In our previous lecture, we did not really specify any functional form of the policy \(\pi\). Instead, we estimated the Q-function, and used the argmax operation to derive the optimal policy. Here, we will directly parameterize the policy \(\pi\) as \(\pi_\theta(a \mid s)\), where \(\theta\) is the parameter vector. A common choice can be a softmax version:
\[ \pi_\theta(a \mid s) = \frac{\exp(\phi(s,a)^\top \theta)}{\sum_{a'} \exp(\phi(s,a')^\top \theta)}, \]
where \(\phi(s,a)\) is a feature vector for the state-action pair \((s,a)\). A simple example is the logistic regression for binary action space. With this parameterization in mind, we can consider the objective function for policy optimization as
\[ \begin{align} J(\theta) =& \, \E_{\, h \sim p_{\theta} \,} \bigg[ \sum_{t=0}^\infty \gamma^t r(s_t, a_t) \bigg] \\ =& \, \E_{s_0 \sim \mu_0} \Big[ V^{\pi_\theta}(s_0) \Big] \end{align} \]
This is understood as the expected discounted reward when following the policy \(\pi_\theta\), but with the initial stage \(s_0\) also integrated out according to the initial state distribution \(\mu_0\) that represent a population that we are interested in. Here the notation \(V^{\pi_\theta}(s)\) is the value function under policy \(\pi_\theta\). Our eventual goal is to find the optimal parameter \(\theta^*\) that maximizes \(J(\theta)\), but as intermediate step, we may be just concerned about finding a new policy that improves upon the current policy. In general, we would treat this as an optimization problem that maximizes \(J(\theta)\) by updating \(\theta\) along the gradient direction. This would involve computing the gradient \(\nabla_\theta J(\theta)\). This general approach is known as the policy gradient method.
The following theorem gives us an explicit expression for the policy gradient.
(Policy Gradient Theorem) Given the objective function \(J(\theta)\), the gradient can be expressed as \[ \nabla_\theta J(\theta) = \frac{1}{1-\gamma} \, \mathbb{E}_{(s,a)\sim d^{\pi_\theta}} \Big[ \nabla_\theta \log \pi_\theta(a\mid s)\, Q^{\pi_\theta}(s,a) \Big]. \] Here, \(d^{\pi_\theta}(s,a)\) is the normalized discounted state-action visitation distribution under policy \(\pi_\theta\), defined as (for simplicity in tabular case): \[ d^{\pi_\theta}(s,a) = (1-\gamma) \sum_{t=0}^\infty \gamma^t \, \mathbb{P}_{\pi_\theta}(s_t = s, a_t = a). \]
To prove this theorem, we start with the value function for a given state \(s\). Following the definition of the value function, under policy \(\pi_\theta\), we have
\[ V^{\pi_\theta}(s) = \sum_{a} \pi_\theta(a\mid s)\,Q^{\pi_\theta}(s,a). \]
Noticing that both \(\pi_\theta\) and \(Q^{\pi_\theta}\) depend on \(\theta\), we will differentiate \(V^{\pi_\theta}(s)\) with respect to \(\theta\) using the product rule:
\[ \nabla_\theta V^{\pi_\theta}(s) = \sum_{a} \big[ \nabla_\theta \pi_\theta(a\mid s)\big]\,Q^{\pi_\theta}(s,a) + \sum_{a} \pi_\theta(a\mid s)\,\nabla_\theta Q^{\pi_\theta}(s,a). \tag{1} \]
We cannot simply take the derivative of \(Q^{\pi_\theta}(s,a)\) directly, since it is a quantity that obeys its own Bellman equation. Thus we will express the Q function further into its Bellman equation form:
\[ Q^{\pi_\theta}(s,a) = R(s,a) + \gamma\,\mathbb{E}_{s' \sim P(\cdot\mid s,a)}\big[ V^{\pi_\theta}(s') \big]. \]
Noticing that \(R\) and \(P\) has been defined once given the specific action \(a\), hence they do not depend on \(\theta\). We have
\[ \nabla_\theta Q^{\pi_\theta}(s,a) = \gamma\,\mathbb{E}_{s' \sim P(\cdot\mid s,a)}\big[ \nabla_\theta V^{\pi_\theta}(s') \big]. \]
Substitute into the previous expression for \(\nabla_\theta V^{\pi_\theta}(s)\):
\[ \begin{aligned} \nabla_\theta V^{\pi_\theta}(s) &= \sum_{a} \big[ \nabla_\theta \pi_\theta(a\mid s)\big]\,Q^{\pi_\theta}(s,a) \\ &\quad + \gamma \sum_{a} \pi_\theta(a\mid s) \, \mathbb{E}_{s' \sim P(\cdot\mid s,a)}\big[ \nabla_\theta V^{\pi_\theta}(s') \big]. \end{aligned} \]
Now we will utilize a log-derivative trick. Noticing that
\[ \nabla_\theta \pi_\theta(a\mid s) = \pi_\theta(a\mid s)\,\nabla_\theta \log \pi_\theta(a\mid s). \]
This allows us to rewrite the first term in (1) as an expectation under the policy distribution \(\pi_\theta(\cdot\mid s)\):
\[ \begin{aligned} &\sum_{a} \big[ \nabla_\theta \pi_\theta(a\mid s)\big]\,Q^{\pi_\theta}(s,a) \\ =& \sum_{a} \pi_\theta(a\mid s)\,\nabla_\theta \log \pi_\theta(a\mid s)\,Q^{\pi_\theta}(s,a) \\ =& \, \mathbb{E}_{a \sim \pi_\theta(\cdot\mid s)}\big[ \nabla_\theta \log \pi_\theta(a\mid s)\,Q^{\pi_\theta}(s,a) \big]. \end{aligned} \]
For the second term in (1), observe that \[ \gamma \sum_{a} \pi_\theta(a\mid s) \mathbb{E}_{s' \sim P(\cdot\mid s,a)}\big[ \nabla_\theta V^{\pi_\theta}(s') \big] = \gamma\,\mathbb{E}_{a \sim \pi_\theta(\cdot\mid s),\,s' \sim P(\cdot\mid s,a)} \big[ \nabla_\theta V^{\pi_\theta}(s') \big]. \]
Thus we obtain the recursive relation \[ \nabla_\theta V^{\pi_\theta}(s) = \mathbb{E}_{a \sim \pi_\theta(\cdot\mid s)}\big[ \nabla_\theta \log \pi_\theta(a\mid s)\,Q^{\pi_\theta}(s,a) \big] + \gamma\,\mathbb{E}_{a \sim \pi_\theta(\cdot\mid s),\,s' \sim P(\cdot\mid s,a)}\big[ \nabla_\theta V^{\pi_\theta}(s') \big]. \tag{2} \]
The next step is to further unroll this recursion along the trajectory induced by the policy \(\pi_\theta\). We consider a more compact form of the derivative of the value function:
\[ \begin{aligned} &\nabla_\theta V^{\pi_\theta}(s_0)\\ =& \, \E_{a_0 \sim \pi_\theta} \big[ \nabla_\theta \log \pi_\theta(a_0\mid s_0) \,Q^{\pi_\theta}(s_0,a_0) + \gamma\,\E_{s_1 \sim \mathbb{P}^{\pi_\theta}} \big[ \nabla_\theta V^{\pi_\theta}(s_1) \big] | s_0 \big] \\ =& \, \E_{a_0, s_1, a_1} \big[ \nabla_\theta \log \pi_\theta(a_0\mid s_0)\,Q^{\pi_\theta}(s_0,a_0) +\gamma\,\nabla_\theta \log \pi_\theta(a_1\mid s_1)\,Q^{\pi_\theta}(s_1,a_1) +\gamma^2\,\E_{s_2} \big[ \nabla_\theta V^{\pi_\theta}(s_2) \big] | s_0 \big] \\ =& \cdots \\ =& \,\E \Big[ \sum_{t=0}^{T} \gamma^t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\,Q^{\pi_\theta}(s_t,a_t) +\gamma^{T+1} \E_{s_{T+1}} \nabla_\theta V^{\pi_\theta}(s_{T+1}) | s_0 \Big] \\ =& \cdots \\ =& \,\E \Big[ \sum_{t=0}^{\infty} \gamma^t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\,Q^{\pi_\theta}(s_t,a_t) | s_0 \Big]. \end{aligned} \]
By taking expectation over the initial state distribution \(\mu_0\), we have
\[ \begin{aligned} \nabla_\theta J(\theta) &= \E_{s_0 \sim \mu_0} \big[ \nabla_\theta V^{\pi_\theta}(s_0) \big] \\ &= \E \Big[ \sum_{t=0}^{\infty} \gamma^t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\,Q^{\pi_\theta}(s_t,a_t) \Big]. \end{aligned} \]
in which the expectation is taken over the trajectory, including the initial state distribution, the policy, and the state transition dynamics. The form of this is still complicated since it involves the expectation over the infinite horizon. However, there is a neat trick to rewrite this expression using the discounted state-action visitation distribution.
We now introduce the normalized discounted occupancy measure, which describes how frequently each state is visited under a policy when future visits are geometrically discounted. This distribution plays a central role in expressing expected returns, analyzing policy optimization, and understanding the long-run visitation pattern induced by a policy. We will motivate this with a tabular case, but the idea extends to more general settings.
Let the initial state distribution \(\mu_0\) be written as a row vector in \(\mathbb{R}^{1 \times |\mathcal{S}|}\). Under a policy \(\pi\), the state distribution evolves as \[ \mu_{t+1}^\pi = \mu_t^\pi P^\pi, \] where \(P^\pi\) is the transition matrix induced by \(\pi\), with each row summing to one. Repeating this transition \(t\) times yields \[ \mu_t^\pi = \mu_0 (P^\pi)^t. \]
To reflect the discounted nature of future rewards, we weight future state distributions by \(\gamma^t\). The (unnormalized) discounted occupancy measure is defined as \[ d_\gamma^\pi = \sum_{t=0}^\infty \gamma^t \mu_t^\pi, \qquad 0 < \gamma < 1. \]
This is not a probability distribution; its total mass is \(1/(1-\gamma)\). Substituting the expression for \(\mu_t^\pi\), \[ \begin{aligned} d_\gamma^\pi &= \sum_{t=0}^\infty \gamma^t \mu_0 (P^\pi)^t \\ &= \mu_0 \sum_{t=0}^\infty (\gamma P^\pi)^t \\ &= \mu_0 (I - \gamma P^\pi)^{-1}. \end{aligned} \]
Since \(\| \gamma P^\pi \|_\infty = \gamma < 1\), the matrix inverse is well-defined. Multiplying by \(1 - \gamma\) normalizes the measure and yields the discounted stationary visitation distribution: \[ d^\pi = (1-\gamma)\, d_\gamma^\pi = (1-\gamma)\, \mu_0 (I - \gamma P^\pi)^{-1}. \]
Although it may be difficult to estimate this when the state space is large, we can establish a useful recursive relation for \(d^\pi\), which resembles the Bellman equation. Starting from the definition of \(d^\pi\), \[ d^\pi = (1-\gamma)\, \mu_0 + \gamma\, d^\pi P^\pi. \]
For a continuous state space, the relation becomes \[ d^\pi(s) = (1-\gamma)\, \mu_0(s) + \gamma\, \int d^\pi(s') \sum_{a} \pi(a\mid s') P(s\mid s',a)\, ds'. \]
This structure mirrors the Bellman equation for value functions. One may also define the normalized discounted state-action occupancy: \[ d^\pi(s,a) = d^\pi(s)\,\pi(a\mid s), \]
which satisfies
\[ d^\pi(s,a) = (1-\gamma)\, \mu_0(s)\,\pi(a\mid s) + \gamma\, \int d^\pi(s',a')\, P(s\mid s',a')\,\pi(a\mid s)\, ds'\, da'. \]
In practice, we may not have access to the transition dynamics \(P\) or the full state space. However, many modern methods estimate \(d^\pi\) (or density ratios relative to another policy) using only sampled trajectories by solving the fixed-point relation above. See Nachum et al. (2019) for an example.
To proceed, we first rewrite the trajectory-level expectation explicitly. For any measurable function \(f(s_t,a_t)\), the gradient expression takes the form \[ \nabla_\theta J(\theta) = \E_{\pi_\theta}\Bigg[ \sum_{t=0}^{\infty} \gamma^t\, f(s_t,a_t) \Bigg], \] where the expectation is taken over full trajectories \[ \bh = (s_0,a_0,s_1,a_1,\dots), \qquad a_t \sim \pi_\theta(\cdot\mid s_t),\; s_{t+1} \sim P(\cdot\mid s_t,a_t). \]
Writing this expectation explicitly, \[ \E_{\pi_\theta}\Bigg[ \sum_{t=0}^{\infty} \gamma^t\, f(s_t,a_t) \Bigg] = \int_{\bh} \Bigg( \sum_{t=0}^{\infty} \gamma^t\, f(s_t,a_t) \Bigg) p_{\pi_\theta}(\bh)\, d \bh, \] where \(p_{\pi_\theta}(\bh)\) is the probability density of the trajectory under policy \(\pi_\theta\). We then take the sum outside the integral:
\[ \begin{aligned} =& \sum_{t=0}^{\infty} \gamma^t \int_{\bh} f(s_t,a_t)\, p_{\pi_\theta}(\bh)\, d\bh. \\ =& \sum_{t=0}^{\infty} \gamma^t \int_{s_t, \, a_t} f(s_t,a_t) \mu_t^{\pi_\theta}(s_t,a_t)\, da_t\, ds_t. \\ =& \sum_{t=0}^{\infty} \gamma^t \int_{s_t, \, a_t} f(s,a) \mu_t^{\pi_\theta}(s,a)\, da\, ds. \\ =& \int_{s, \, a} f(s,a) \Bigg( \sum_{t=0}^{\infty} \gamma^t \mu_t^{\pi_\theta}(s,a) \Bigg) da\, ds. \\ =& \frac{1}{1-\gamma} \int_{s, \, a} f(s,a) d^{\pi_\theta}(s,a)\, da\, ds. \\ =& \frac{1}{1-\gamma} \E_{s, \, a} \big[ f(s,a) \big], \end{aligned} \]
Applying this result to our policy gradient expression, we set \[ f(s_t,a_t) = \nabla_\theta \log \pi_\theta(a_t\mid s_t)\, Q^{\pi_\theta}(s_t,a_t). \] This gives us the final form of the policy gradient theorem:
\[ \nabla_\theta J(\theta) = \frac{1}{1-\gamma} \, \mathbb{E}_{(s,a)\sim d^{\pi_\theta}} \Big[ \nabla_\theta \log \pi_\theta(a\mid s)\, Q^{\pi_\theta}(s,a) \Big]. \]
The policy gradient theorem provides a general mechanism for updating a parameterized policy toward higher long-term return. Its formulation is agnostic to how data are collected, making it applicable in both online and offline reinforcement learning. In practice, however, policy gradient methods are most naturally deployed in the online setting, where trajectories are generated under the current policy \(\pi_\theta\). This ensures that the expectations in the gradient expression are taken under the correct distribution. Nonetheless, policy gradients can still be used in offline RL. In such cases, the main challenge is distribution mismatch: the available data come from a behavior policy \(\mu\), not the target policy \(\pi_\theta\). A common approach is to reweight the offline data using an estimate of the density ratio \(d^{\pi_\theta}(s,a) / d^{\mu}(s,a)\), allowing the gradient to be approximated. Alternatively, one can consider importance sampling at the trajectory level, though this often suffers from high variance.
In the online regime, classical algorithms such as REINFORCE and actor–critic methods directly apply the policy gradient theorem, but they often suffer from high variance and instability when updates are too aggressive. To address this, modern approaches introduce explicit update regularization. Trust Region Policy Optimization (TRPO, Schulman et al. (2015)) constrains each update by limiting the KL divergence between the old and new policies, ensuring monotonic improvement under mild assumptions. Proximal Policy Optimization (PPO, Schulman et al. (2017)) adopts a simpler implementation using clipped density ration between the new and the old policies, which enforce a similar trust-region behavior. These regularized updates has been very popular in training large language models (using reinforcement learning from human feedback as a way of comparing the rewards).