In the previous finite-horizon Q-learning setup, actions could depend on the entire history of observed states, actions, and rewards. This flexible framework allowed us to learn an optimal dynamic treatment regime that adapts over time based on accumulated information. In this lecture, we shift to the setting of a Markov Decision Process (MDP, Puterman (1994)) and introduce the Bellman equation, which forms the theoretical foundation of most reinforcement learning algorithms. The MDP framework simplifies the problem by assuming that the system evolves according to the Markov property—meaning that the next state depends only on the current state and action, not on the full past trajectory. This assumption significantly reduces the complexity of the problem. By summarizing all relevant information in the current state, we can model transitions and make decisions more efficiently while still capturing the essential dynamics of the process. Furthermore, the MDP formulation typically focuses on optimizing an infinite-horizon objective, where rewards are accumulated over time with a discount applied to future outcomes. This discounted formulation emphasizes short-term outcomes while still valuing long-term effects, which is often desirable in applications such as finance and healthcare, where immediate gains or patient outcomes are generally more critical than distant future rewards.
We now formalize the setup of a Markov Decision Process (MDP). Recall the basic elements from the previous lecture:
It is often assumed that the expected reward depends on the current state and action through a reward function \(r(S_t, A_t) = \E[R_t \mid S_t, A_t]\). In practice, the observed reward can include random noise around this mean. In some models, the reward may also depend on the next state, for example \(R_t = r(S_{t+1}, S_t, A_t)\), which can be interpreted as the outcome realized after the system transitions to a new state.
In the earlier finite-horizon setting, the policy \(\pi_t(\cdot)\) and the transition dynamic \(\P_t(\cdot)\) could depend on the full history of past states and actions, allowing for very flexible policies. The MDP setting simplifies this by assuming that the system satisfies two key properties:
Markov property: The future state depends only on the current state and the chosen action, not on the entire past history:
\[
\P(S_{t+1} \mid S_t, A_t, S_{t-1}, A_{t-1}, \ldots, S_0, A_0)
= \P(S_{t+1} \mid S_t, A_t).
\]
Stationarity: The transition mechanism does not change over time, that is,
\[
\P_t(\cdot) = \P(\cdot) \quad \text{for all } t.
\]
This implies that the same transition probabilities apply at every step.
Furthermore, we introduce a discount factor \(\gamma \in (0,1)\), which will later be used to define the value function and ensure convergence of long-term rewards. A Markov Decision Process is then defined by the tuple
\[ (\cS, \cA, \P, r, \gamma). \]
In practice, we observe \(n\) (or even just one) sequences of i.i.d. copies of \(\{ S_t, A_t, R_t \}_{t=0}^T\) generated from the MDP.
Figure 1: A graph representation of the Markov Decision Process
The goal of a reinforcement learning problem is either to evaluate a given policy \(\pi\) or to find the best policy \(\pi^\ast\) that maximizes1 the expected long-term return. In the MDP framework, we will use the discounted sum of rewards as an objective to evaluate any policy. This is defined using the value function:
\[ V^\pi(s) = \E_\pi \left[ \sum_{t=0}^\infty \gamma^t R_t \mid S_0 = s \right], \]
where \(V^\pi(s)\) is the value function with \(s\) as the starting point, and the trajectory \(\{ S_t, A_t, R_t \}_{t=0}^\infty\) is generated under policy \(\pi\). The expectation is taken over the randomness in both the state transition kernel \(\P(S_{t+1} \mid S_t, A_t)\) and (if applicable) the stochastic policy \(\pi(A_t \mid S_t)\). A related and equally important concept is the action-value function, or Q-function, defined as
\[ Q^\pi(s, a) = \E_\pi \left[ \sum_{t=0}^\infty \gamma^t R_t \mid S_0 = s, A_0 = a \right]. \]
This quantity measures the expected cumulative reward if the agent starts from state \(s\), takes action \(a\) at the first step, and then continues to follow policy \(\pi\) afterward. The Q-function therefore captures the quality of a particular action in a given state.
Instead of averaging all future rewards (Wei et al. 2020; Liao et al. 2022), the use of a discount factor \(\gamma \in (0,1)\) ensures that future rewards contribute less to the total return and that the infinite series above converges . When the rewards are bounded by \(|R_t| \le R_{\max}\), the discounted sum is guaranteed to be finite:
\[ \Big| \sum_{t=0}^\infty \gamma^t R_t \Big| \le \sum_{t=0}^\infty \gamma^t R_{\max} = \frac{R_{\max}}{1 - \gamma}. \]
The discounted formulation also allows us to develop recursive relationships between \(V^\pi\) and \(Q^\pi\), which will lead to the Bellman equations introduced later. Conceptually, a smaller \(\gamma\) places more emphasis on immediate rewards, while a larger \(\gamma\) values long-term outcomes more heavily. This weighting is often desirable in real-world problems where short-term benefits typically carry greater importance than distant future gains.
Due to the Markov property and stationarity, the value and Q-functions can be defined starting from any stage \(t\), and their forms will be identical to those starting from \(t = 0\). That is,
\[ V_t^\pi(s) = \E_\pi \left[ \sum_{j=0}^\infty \gamma^{j} R_{t + j} \mid S_t = s \right], \qquad Q_t^\pi(s, a) = \E_\pi \left[ \sum_{j=0}^\infty \gamma^{j} R_{t + j} \mid S_t = s, A_t = a \right]. \]
The subscript \(t\) is often omitted when the process is stationary, since these functions do not depend explicitly on time.
A central concept in reinforcement learning is the Bellman equation (Sutton and Barto 2018), which expresses a recursive relationship between the value (or Q-) function at any time \(t\) and the value function at the next step. Consider a deterministic policy \(A_t = \pi(S_t)\) (the stochastic case is similar, integrating over the policy distribution). Starting from the definition of the Q-function, we have
\[ \begin{aligned} Q^\pi(s, a) &= \E_\pi \left[ R_t + \gamma \sum_{j=1}^\infty \gamma^{j-1} R_{t+j} \mid S_t = s, A_t = a \right] \\ &= r(s, a) + \gamma \E_{S_{t+1} \sim \P(\cdot \mid s, a)} \left[ \E_\pi \Big[ \sum_{j=0}^\infty \gamma^{j} R_{t+1+j} \mid S_{t+1} \Big] \right] \\ &= r(s, a) + \gamma \E_{S_{t+1} \sim \P(\cdot \mid s, a)} \big[ V^\pi(S_{t+1}) \big]. \end{aligned} \]
This shows that the Q-function equals the immediate reward plus the discounted expected value of the next state. The corresponding recursive equation for the value function is obtained by averaging over actions under the policy \(\pi\):
\[ V^\pi(s) = \E_{A \sim \pi(\cdot \mid s)} \big[ Q^\pi(s, A) \big]. \]
Combining the two relations (substitute the second one into the first) gives the Bellman equation for the Q-function:
\[ Q^\pi(s, a) = r(s, a) + \gamma \E_{S' \sim \P(\cdot \mid s, a),\, A' \sim \pi(\cdot \mid S')} \big[ Q^\pi(S', A') \big]. \tag{1} \]
Here \((S', A')\) denote the (random) next state-action pair generated by the environment and the policy. This recursive structure forms the foundation for most reinforcement learning algorithms, including dynamic programming, temporal-difference learning, and Q-learning. If we instead integrate the entire expectation over the distribution of the action \(A\) at the current state, we obtain the Bellman expectation equation for the value function:
\[ V^\pi(s) = \E_{A \sim \pi(\cdot \mid s)} \big[ r(s, A) + \gamma \E_{ S' \sim \P(\cdot \mid s, A) } [ V^\pi(S') ] \big]. \tag{2} \]
To better illustrate how the Bellman equation operates, consider a finite discrete state space, that is, \(S_t \in \cS\) with \(|\cS| < \infty\). Also, we will consider a finite discrete action space \(A_t \in \cA\) with \(|\cA| < \infty\). In this case, both the transition dynamics and the value function can be expressed in matrix form and we will be able to solve for the value function exactly using linear algebra. To see this, let \(\bP^\pi\) be the \(|\cS| \times |\cS|\) transition matrix that represents the probability of moving from one state to another under policy \(\pi\). Each element \(P^\pi_{s, s'}\) gives the probability of transitioning from state \(s\) to state \(s'\) in one step when the agent follows policy \(\pi\):
\[ \begin{aligned} P^\pi_{s, s'} &= \P(S_{t+1} = s' \mid S_t = s, A_t \sim \pi(\cdot \mid s)) \\ &= \E_{A_t \sim \pi(\cdot \mid s)} \big[ \P(S_{t+1} = s' \mid S_t = s, A_t) \big] \\ &= \sum_{a \in \cA} \P(S_{t+1} = s' \mid S_t = s, A_t = a)\, \pi(a \mid s) \end{aligned} \]
where each row of \(\bP^\pi\) is a probability distribution, satisfying \(\sum_{s'} P^\pi_{s, s'} = 1\) for every \(s \in \cS\). Starting from the Bellman expectation equation defined in (2), we can expand the expectations explicitly: \[ V^\pi(s) = \sum_{a \in \cA} \pi(a \mid s) \Big[ r(s, a) + \gamma \sum_{s' \in \cS} \P(s' \mid s, a) V^\pi(s') \Big]. \] Define the expected reward under the policy at each state by \[ r^\pi(s) = \sum_{a \in \cA} \pi(a \mid s)\, r(s, a), \] and recall that the transition probability under the policy is given by \[ P^\pi_{s, s'} = \sum_{a \in \cA} \pi(a \mid s)\, \P(s' \mid s, a). \] Substituting these definitions into the previous equation yields the simplified scalar Bellman equation \[ V^\pi(s) = r^\pi(s) + \gamma \sum_{s' \in \cS} P^\pi_{s, s'}\, V^\pi(s'). \]
Because there are finitely many states, we can stack all value functions into a vector \(\bv^\pi = (V^\pi(s_1), \ldots, V^\pi(s_{|\cS|}))^\top\), and similarly define the reward vector \(\br^\pi = (r^\pi(s_1), \ldots, r^\pi(s_{|\cS|}))^\top\). Then the Bellman equation for all states can be written compactly as \[ \bv^\pi = \br^\pi + \gamma \bP^\pi \bv^\pi. \tag{3} \] This is a system of \(|\cS|\) linear equations with \(|\cS|\) unknowns. Rearranging terms gives \[ (I - \gamma \bP^\pi)\, \bv^\pi = \br^\pi, \] and when \(I - \gamma \P^\pi\) is invertible2, the unique solution is \[ \bv^\pi = (I - \gamma \bP^\pi)^{-1} \br^\pi. \]
This matrix formulation provides a compact and useful representation of the Bellman equation in the tabular case. When both \(\P^\pi\) and \(r^\pi\) are known, one can compute the exact value function through direct matrix inversion. This is the task of policy evaluation. In practice, however, \(\P^\pi\) and \(r^\pi\) are usually unknown and must be estimated from data, motivating algorithmic approaches such as Monte Carlo evaluation, temporal-difference learning, and Q-learning.
Let’s consider a simple three-state MDP and two treatment actions. When the treatment is 1, the transition probability matrix associated with \(\P(S_{t+1} = s' | S_t = s, A_t = 1 )\) is
\[ P_1 = \begin{pmatrix} 0.9 & 0.075 & 0.025 \\ 0.15 & 0.8 & 0.05 \\ 0.25 & 0.25 & 0.5 \end{pmatrix} \]
The transition probability matrix associated with \(\P(S_{t+1} = s' | S_t = s, A_t = 2 )\) is
\[ P_2 = \begin{pmatrix} 0.8 & 0.1 & 0.1 \\ 0.2 & 0.6 & 0.2 \\ 0.3 & 0.3 & 0.4 \end{pmatrix} \]
We assume a stochastic policy with two possible actions, where the probability of taking the first action at each state is
\[ \pi = \begin{pmatrix} 1/2 & 1/2 \\ 2/3 & 1/3 \\ 1/3 & 2/3 \end{pmatrix} \]
Then the expected transition matrix is
\[ P = \begin{pmatrix} 0.85 & 0.0875 & 0.0625 \\ 0.1667 & 0.7333 & 0.1 \\ 0.2833 & 0.2833 & 0.4333 \end{pmatrix} \]
We further define the reward function \(r(s, a)\) for each state-action pair as
\[ r = \begin{pmatrix} 1 & 3 \\ 2 & 4 \\ 3 & 1 \end{pmatrix} \]
Using Equation (3), we can solve for the value function.
# Define the transition probability matrix and rewards as before
P1 <- matrix(c(0.9, 0.075, 0.025,
0.15, 0.8, 0.05,
0.25, 0.25, 0.5), byrow=TRUE, nrow=3)
P2 <- matrix(c(0.8, 0.1, 0.1,
0.2, 0.6, 0.2,
0.3, 0.3, 0.4), byrow=TRUE, nrow=3)
# Define a stochastic policy
# The first row means that we will take action 1 with probability 1/2, action 2 with probability 1/2
policy <- matrix(c(1/2, 1/2,
2/3, 1/3,
1/3, 2/3), byrow=TRUE, nrow=3)
# Define the expected transition matrix
P = sweep(P1, 1, policy[,1], "*") + sweep(P2, 1, policy[,2], "*")
# define the mean reward function for each state and action pair
r = matrix(c(1, 3,
2, 4,
3, 1), byrow=TRUE, nrow=3)
# define the expected reward for each state
r_expected = rowSums(r * policy)
# define the discount factor
gamma = 0.7
# Solve for the value function using matrix inversion
I = diag(3)
v = solve(I - gamma * P) %*% r_expected
v
## [,1]
## [1,] 6.879767
## [2,] 8.085625
## [3,] 6.652827
Its not clear if this is correct. But we could perform some simulated data to validate the result. For demonstration, let’s simulation the discounted cumulative reward starting at state 2. This is called a Monte Carlo method.
# write a function to simulate the MDP
simulate_MDP <- function(n, policy, P1, P2, r, gamma, s0){
# prepare the data
S = rep(NA, n)
A = rep(NA, n)
R = rep(NA, n)
# lets start from the first state
S[1] = s0
# sequential simulation
for (t in 1:n){
# take an action based on the policy A_t
A[t] = sample(1:2, 1, prob = policy[S[t],])
# get the reward r(S_t, A_t) with a bit of noise
R[t] = r[S[t], A[t]] + rnorm(1)*0.1
if (t < n){
# get the next state depending on the action
if (A[t] == 1){
S[t+1] = sample(1:3, 1, prob = P1[S[t],])
} else {
S[t+1] = sample(1:3, 1, prob = P2[S[t],])
}
}
}
#record the cumulative discounted reward
return(list(S = S, A = A, R = R, cum_reward = sum(gamma^seq(0, n-1) %*% R)))
}
# Simulate the data
# generate 100 simulations
nsim = 1000
cum_reward = rep(NA, nsim)
set.seed(1)
for (k in 1:nsim)
{
#record the cumulative discounted reward
cum_reward[k] = simulate_MDP(60, policy, P1, P2, r, gamma, 2)$cum_reward
}
mean(cum_reward)
## [1] 8.043472
This is very close to the true policy value 8.0856248.
In practice, we do not directly observe the transition matrix and the expected reward. Instead, we have a sequence of observed data \(\{ S_t, A_t, R_t \}_{t=0}^T\). We can use this data to estimate the transition matrix and the expected reward and plug them into the Bellman equation. Let’s look at an example in which we first observe a sequence of data using the same policy defined previously, and then we want to estimate the policy value if we follow a new policy
\[ \pi_\text{new} = \begin{pmatrix} 1/4 & 3/4 \\ 1/2 & 1/2 \\ 3/3 & 1/3 \end{pmatrix} \]
First, we estimate \(P_1\), \(P_2\) and \(r\) based on the observed data.
# Simulate a sequence of 1000 observations
# It doesn't really matter which initial state we start from
mydate = simulate_MDP(5000, policy, P1, P2, r, gamma, 2)
S = mydate$S
A = mydate$A
R = mydate$R
# Estimate the transition matrix
P1_hat = matrix(0, nrow=3, ncol=3)
P2_hat = matrix(0, nrow=3, ncol=3)
P1_count = 0
P2_count = 0
# estimate the mean reward for each state and action pair
r_hat = matrix(0, nrow=3, ncol=2)
r_count = matrix(0, nrow=3, ncol=2)
for (t in 1:(length(S)-1)){
# for each time point, we count the transition
if (A[t] == 1){
P1_hat[S[t], S[t+1]] = P1_hat[S[t], S[t+1]] + 1
} else {
P2_hat[S[t], S[t+1]] = P2_hat[S[t], S[t+1]] + 1
}
# add and count the reward
r_hat[S[t], A[t]] = r_hat[S[t], A[t]] + R[t]
r_count[S[t], A[t]] = r_count[S[t], A[t]] + 1
}
# calculate the observed transition matrix
P1_hat = P1_hat / rowSums(P1_hat)
P2_hat = P2_hat / rowSums(P2_hat)
# calculate the mean reward for each state and action pair
r_hat = r_hat / r_count
# Print the estimated quantities
P1_hat
## [,1] [,2] [,3]
## [1,] 0.9101754 0.06877193 0.02105263
## [2,] 0.1441006 0.80464217 0.05125725
## [3,] 0.2197309 0.30493274 0.47533632
P2_hat
## [,1] [,2] [,3]
## [1,] 0.7895100 0.09592823 0.1145618
## [2,] 0.2374245 0.56539235 0.1971831
## [3,] 0.3180593 0.30188679 0.3800539
r_hat
## [,1] [,2]
## [1,] 0.9994502 3.000719
## [2,] 2.0002712 3.998590
## [3,] 2.9867864 1.001137
Then we define the new policy and use the Bellman equation to estimate its policy value
# Define the new policy
policy_new = matrix(c(1/4, 3/4,
1/2, 1/2,
1/3, 2/3), byrow=TRUE, nrow=3)
# calculate the expected transition matrix and reward
r_expected_hat = rowSums(r_hat * policy_new)
P_hat = sweep(P1_hat, 1, policy_new[,1], "*") + sweep(P2_hat, 1, policy_new[,2], "*")
# Solve for the value function using matrix inversion
v_hat = solve(I - gamma * P_hat) %*% r_expected_hat
v_hat
## [,1]
## [1,] 8.309056
## [2,] 9.128629
## [3,] 7.387469
Let’s validate this using new simulated data
# Simulate the data
# generate 100 simulations
nsim = 1000
cum_reward = rep(NA, nsim)
set.seed(1)
for (k in 1:nsim)
{
#record the cumulative discounted reward
cum_reward[k] = simulate_MDP(60, policy_new, P1, P2, r, gamma, 3)$cum_reward
}
mean(cum_reward)
## [1] 7.278483
We can see that our estimated policy value at state 3 is 7.3874687, while the simulated truth is 7.2784831.
In this setting, the policy we are interested would also be stationary, i.e., \(\pi_t(\cdot) = \pi(\cdot)\) for all \(t\). Under the Markov and stationarity assumptions, the current state \(S_t\) contains all relevant information needed for future decision making. Consequently, the optimal policy (a common goal in reinforcement learning) does not need to depend on the entire history but only on the current state. In fact, for the discounted infinite-horizon problem, it can be shown that there always exists an optimal deterministic stationary policy \(\pi^\ast(S_t)\) that maximizes the long-term expected return. This result is fundamental in dynamic programming and reinforcement learning. These will become more clear after we introduce some related concepts.↩︎
To see why \(I - \gamma P^\pi\) is invertible, note that every row of \(P^\pi\) sums to one (row-stochastic), it is a contraction in \(\|\cdot\|_\infty\): \(\|P^\pi x\|_\infty \le \|x\|_\infty\) for all \(x\). Then if \((I-\gamma P^\pi)x=0\) for some \(x\), then \(x=\gamma P^\pi x\). Taking \(\|\cdot\|_\infty\), \[ \|x\|_\infty \;=\; \gamma \|P^\pi x\|_\infty \;\le\; \gamma \|x\|_\infty. \] Thus \((1-\gamma)\|x\|_\infty \le 0\). With \(0<\gamma<1\), this forces \(\|x\|_\infty=0\), hence \(x=0\). So the null space is trivial and \(I-\gamma P^\pi\) is invertible.↩︎