Previously in the Q-learning example, we have introduced a framework of how the state variable, action and reward evolves over time, and how we can use the data to estimate the optimal dynamic treatment regime. In this lecture, we will introduce the concept of Markov Decision Process (MDP, Puterman (1994)) and the Bellman Equation, which is the foundation of many reinforcement learning algorithms. The MDP setting simplifies the previous setting by assuming that the state variable follows a Markov property, which means that the future state only depends on the current state and action, rather than the entire history. This simplification allows us to estimate things more efficiently because we can now borrow the information from the past to estimate the future.
In the previous lecture, we have introduced the following notation:
In our Q-learning setting, our decision \(\mu_t(\cdot)\) and transition probability \(\P_t(\cdot)\) can depend on the entire history of states and previous actions. This is a general setting that allows for a very flexible decision rule. However, in the MDP setting, we will assume that the decision rule only depends on the current state, i.e., \(A_t = \mu_t(S_t)\) and the transition only depends on the current state and action, i.e., \(S_{t+1} \sim \P(\cdot | S_t, A_t)\). These are enabled by the Markov property,
\[ \P(S_{t+1} | S_t, A_t, S_{t-1}, A_{t-1}, \ldots, S_0, A_0) = \P(S_{t+1} | S_t, A_t). \]
and we will also assume the stationary probability, i.e., \(\P_t(\cdot) = \P(\cdot)\) for all \(t\). In addition, we will introduce a discount factor
And overall, a MDP is specified by \((\cS, \cA, \P, r, \gamma )\).
The goal of the MDP is to find a policy \(\pi\) that maximizes the expected discounted sum of rewards, which is defined as
\[ \cV^\pi(s) = \E_\pi \left[ \sum_{t=0}^\infty \gamma^t R_t | S_0 = s \right], \]
where the expectation is taken under the distribution \(\pi(A_t | S_t )\). Why are we interested in the discounted reward setting, rather than the average reward, e.g., Wei et al. (2020) and Liao et al. (2022)? This is mainly motivated from an infinite return. To see this, when the reward \(R_t\) is bounded with \(|R_t| \leq R_\text{max}\), the above discounted sum of rewards is also bounded due to
\[ \Big| \sum_{t=0}^\infty \gamma^t R_t \Big| \leq \sum_{t=0}^\infty \gamma^t R_\text{max} = \frac{R_\text{max}}{1-\gamma}. \]
Also, using a discount factor would allow us to develop a recursive equation which is easier to solve. Moreover, paying more attention to the immediate reward is also more practical in many applications, e.g., in finance and health, the immediate reward is more important than the future reward. Smaller \(\gamma\) means larger discount, and hence the future reward becomes less important.
The Q-functions we defined previously would naturally be adapted to the MDP setting:
\[ Q^\pi(s, a) = \E_\pi \left[ \sum_{t=0}^\infty \gamma^t R_t | S_0 = s, A_0 = a \right]. \]
We can also define the value and Q-functions at stage \(t\) as
\[ \cV^\pi_t(s) = \E_\pi \left[ \sum_{j=0}^\infty \gamma^{j} R_{t + j} | S_t = s \right], \]
and
\[ Q^\pi_t(s, a) = \E_\pi \left[ \sum_{j=0}^\infty \gamma^{j} R_{t + j} | S_t = s, A_t = a \right]. \]
Probably the most important tool in reinforcement learning is the Bellman equation (Bellman 1966; Sutton and Barto 2018). The Bellman equation is a recursive equation that relates the value (Q) function at time \(t\) to the value function at time \(t+1\). We have seen a similar equation in the Q-learning setting, but here we re-state it in the MDP setting, essentially just removing the long history chain. Let’s consider a deterministic policy with \(A_t = \pi(S_t)\)2. Noticing that the Q-function can be written as
\[ \begin{aligned} Q^\pi_t(s, a) &= \E_\pi \left[ R_t + \gamma \sum_{j=1}^\infty \gamma^{j} R_{t+j} | S_t = s, A_t = a \right] \\ &= r(s, a) + \gamma \E_\pi \left[ \sum_{j=1}^\infty \gamma^{j-1} R_{t+j} | S_t = s, A_t = a \right] \\ &= r(s, a) + \gamma \E_{S_{t+1} \sim P(\cdot| s, a)} \left[ \E_\pi \bigg[ \sum_{j=0}^\infty \gamma^{j} R_{t+1+j} | S_{t+1}, S_t = s, A_t = a \bigg] \right] \\ &= r(s, a) + \gamma \E_{S_{t+1} \sim P(\cdot| s, a)} \big[ \cV^\pi(S_{t+1}) \big] \\ \end{aligned} \]
Which demonstrates that the Q-function at time \(t\) can be written as the immediate reward plus the expected future reward. The notation is essentially irrelevant to the choice of \(t\) if we assume the stationary property. To better illustrate how a Bellman equation can be useful, it would be interesting to consider a finite discrete state space, i.e., \(S_t \in \cS\), with \(|\cS| < \infty\). Let’s define a transition matrix \(P^\pi\). This is a \(|\cS| \times |\cS|\) dimensional matrix that describes the probability of transitioning from state \(s\) at time \(t\) to state \(s'\) at time \(t + 1\) under policy \(\pi\). The element \(P^\pi_{s, s'}\) is defined as
\[ \begin{aligned} P^\pi_{s, s'} &= \P(S_{t+1} = s' | S_t = s, A_t \sim \pi(s) ) \\ &= \E_{A_t \sim \pi(S_t)} \left[ \P(S_{t+1} = s' | S_t = s, A_t) \right] \\ &= \sum_{a \in \cA} \P(S_{t+1} = s' | S_t = s, A_t = a) \P(A_t = a | S_t = s) \qquad (\text{if } A_t \text{ is discrete}) \end{aligned} \]
Since we have a finite number of states, the value function can only take a finite number of values. Hence, we can write the value function as a vector \(\bv^\pi \in \bR^{|\cS|}\), where the \(i\)-th element is the value function at state \(i\). The Bellman equation can be written collectively as
\[ \bv^\pi = \br^\pi + \gamma P^\pi \bv^\pi, \tag{1} \]
where
This equation is a linear equation, and we can solve it using matrix inversion, assuming that \(I - \gamma P^\pi\) is invertible3. The solution is
\[ \bv^\pi = (I - \gamma P^\pi)^{-1} \br^\pi. \]
In practice, this formulation is a bit unrealistic because we often do not have the transition matrix \(P^\pi\) and the expected reward vector \(\br^\pi\). However, assuming if we do, then we can perform a task of policy evaluation, i.e., estimating the value function under a given policy. Still, in general, we will need some other algorithms to estimate the value function.
Let’s consider a simple three-state MDP and two treatment actions. When the treatment is 1, the transition probability matrix associated with \(\P(S_{t+1} = s' | S_t = s, A_t = 1 )\) is
\[ P_1 = \begin{pmatrix} 0.9 & 0.075 & 0.025 \\ 0.15 & 0.8 & 0.05 \\ 0.25 & 0.25 & 0.5 \end{pmatrix} \]
The transition probability matrix associated with \(\P(S_{t+1} = s' | S_t = s, A_t = 2 )\) is
\[ P_2 = \begin{pmatrix} 0.8 & 0.1 & 0.1 \\ 0.2 & 0.6 & 0.2 \\ 0.3 & 0.3 & 0.4 \end{pmatrix} \]
We assume a stochastic policy with two possible actions, where the probability of taking the first action at each state is
\[ \pi = \begin{pmatrix} 1/2 & 1/2 \\ 2/3 & 1/3 \\ 1/3 & 2/3 \end{pmatrix} \]
Then the expected transition matrix is
\[ P = \begin{pmatrix} 0.85 & 0.0875 & 0.0625 \\ 0.1667 & 0.7333 & 0.1 \\ 0.2833 & 0.2833 & 0.4333 \end{pmatrix} \]
We further define the reward function \(r(s, a)\) for each state-action pair as
\[ r = \begin{pmatrix} 1 & 3 \\ 2 & 4 \\ 3 & 1 \end{pmatrix} \]
Using Equation (1), we can solve for the value function.
# Define the transition probability matrix and rewards as before
P1 <- matrix(c(0.9, 0.075, 0.025,
0.15, 0.8, 0.05,
0.25, 0.25, 0.5), byrow=TRUE, nrow=3)
P2 <- matrix(c(0.8, 0.1, 0.1,
0.2, 0.6, 0.2,
0.3, 0.3, 0.4), byrow=TRUE, nrow=3)
# Define a stochastic policy
# The first row means that we will take action 1 with probability 1/2, action 2 with probability 1/2
policy <- matrix(c(1/2, 1/2,
2/3, 1/3,
1/3, 2/3), byrow=TRUE, nrow=3)
# Define the expected transition matrix
P = sweep(P1, 1, policy[,1], "*") + sweep(P2, 1, policy[,2], "*")
# define the mean reward function for each state and action pair
r = matrix(c(1, 3,
2, 4,
3, 1), byrow=TRUE, nrow=3)
# define the expected reward for each state
r_expected = rowSums(r * policy)
# define the discount factor
gamma = 0.7
# Solve for the value function using matrix inversion
I = diag(3)
v = solve(I - gamma * P) %*% r_expected
v
## [,1]
## [1,] 6.879767
## [2,] 8.085625
## [3,] 6.652827
Its not clear if this is correct. But we could perform some simulated data to validate the result. For demonstration, let’s simulation the discounted cumulative reward starting at state 2. This is called a Monte Carlo method.
# write a function to simulate the MDP
simulate_MDP <- function(n, policy, P1, P2, r, gamma, s0){
# prepare the data
S = rep(NA, n)
A = rep(NA, n)
R = rep(NA, n)
# lets start from the first state
S[1] = s0
# sequential simulation
for (t in 1:n){
# take an action based on the policy A_t
A[t] = sample(1:2, 1, prob = policy[S[t],])
# get the reward r(S_t, A_t) with a bit of noise
R[t] = r[S[t], A[t]] + rnorm(1)*0.1
if (t < n){
# get the next state depending on the action
if (A[t] == 1){
S[t+1] = sample(1:3, 1, prob = P1[S[t],])
} else {
S[t+1] = sample(1:3, 1, prob = P2[S[t],])
}
}
}
#record the cumulative discounted reward
return(list(S = S, A = A, R = R, cum_reward = sum(gamma^seq(0, n-1) %*% R)))
}
# Simulate the data
# generate 100 simulations
nsim = 1000
cum_reward = rep(NA, nsim)
set.seed(1)
for (k in 1:nsim)
{
#record the cumulative discounted reward
cum_reward[k] = simulate_MDP(60, policy, P1, P2, r, gamma, 2)$cum_reward
}
mean(cum_reward)
## [1] 8.043472
This is very close to the true policy value 8.0856248
.
In practice, we do not directly observe the transition matrix and the expected reward. Instead, we have a sequence of observed data \(\{ S_t, A_t, R_t \}_{t=0}^T\). We can use this data to estimate the transition matrix and the expected reward and plug them into the Bellman equation. Let’s look at an example in which we first observe a sequence of data using the same policy defined previously, and then we want to estimate the policy value if we follow a new policy
\[ \pi_\text{new} = \begin{pmatrix} 1/4 & 3/4 \\ 1/2 & 1/2 \\ 3/3 & 1/3 \end{pmatrix} \]
First, we estimate \(P_1\), \(P_2\) and \(r\) based on the observed data.
# Simulate a sequence of 1000 observations
# It doesn't really matter which initial state we start from
mydate = simulate_MDP(5000, policy, P1, P2, r, gamma, 2)
S = mydate$S
A = mydate$A
R = mydate$R
# Estimate the transition matrix
P1_hat = matrix(0, nrow=3, ncol=3)
P2_hat = matrix(0, nrow=3, ncol=3)
P1_count = 0
P2_count = 0
# estimate the mean reward for each state and action pair
r_hat = matrix(0, nrow=3, ncol=2)
r_count = matrix(0, nrow=3, ncol=2)
for (t in 1:(length(S)-1)){
# for each time point, we count the transition
if (A[t] == 1){
P1_hat[S[t], S[t+1]] = P1_hat[S[t], S[t+1]] + 1
} else {
P2_hat[S[t], S[t+1]] = P2_hat[S[t], S[t+1]] + 1
}
# add and count the reward
r_hat[S[t], A[t]] = r_hat[S[t], A[t]] + R[t]
r_count[S[t], A[t]] = r_count[S[t], A[t]] + 1
}
# calculate the observed transition matrix
P1_hat = P1_hat / rowSums(P1_hat)
P2_hat = P2_hat / rowSums(P2_hat)
# calculate the mean reward for each state and action pair
r_hat = r_hat / r_count
# Print the estimated quantities
P1_hat
## [,1] [,2] [,3]
## [1,] 0.9101754 0.06877193 0.02105263
## [2,] 0.1441006 0.80464217 0.05125725
## [3,] 0.2197309 0.30493274 0.47533632
P2_hat
## [,1] [,2] [,3]
## [1,] 0.7895100 0.09592823 0.1145618
## [2,] 0.2374245 0.56539235 0.1971831
## [3,] 0.3180593 0.30188679 0.3800539
r_hat
## [,1] [,2]
## [1,] 0.9994502 3.000719
## [2,] 2.0002712 3.998590
## [3,] 2.9867864 1.001137
Then we define the new policy and use the Bellman equation to estimate its policy value
# Define the new policy
policy_new = matrix(c(1/4, 3/4,
1/2, 1/2,
1/3, 2/3), byrow=TRUE, nrow=3)
# calculate the expected transition matrix and reward
r_expected_hat = rowSums(r_hat * policy_new)
P_hat = sweep(P1_hat, 1, policy_new[,1], "*") + sweep(P2_hat, 1, policy_new[,2], "*")
# Solve for the value function using matrix inversion
v_hat = solve(I - gamma * P_hat) %*% r_expected_hat
v_hat
## [,1]
## [1,] 8.309056
## [2,] 9.128629
## [3,] 7.387469
Let’s validate this using new simulated data
# Simulate the data
# generate 100 simulations
nsim = 1000
cum_reward = rep(NA, nsim)
set.seed(1)
for (k in 1:nsim)
{
#record the cumulative discounted reward
cum_reward[k] = simulate_MDP(60, policy_new, P1, P2, r, gamma, 3)$cum_reward
}
mean(cum_reward)
## [1] 7.278483
We can see that our estimated policy value at state 3 is 7.3874687
, while the simulated truth is 7.2784831
.
Based on the model assumption of MDP, we can write \(r(S_t, A_t) = \int_{s_{t+1}} r_t(S_{t+1}, S_t, A_t) \P(S_{t+1} | S_t, A_t) d S_{t+1}\)↩︎
A stochastic case is similar by properly integrate out A_t based on its conditional distribution.↩︎
In fact, this can be shown by proving that any eign-value of \(I - \gamma P^\pi\) is nonzero. To do this, we can show that the matrix does not diminish the size of any non-zero vector (if \(x \neq 0\), then \(Ax \neq 0\)). We can see that for any vector \(x \neq 0\), \[ \begin{aligned} \| (I - \gamma P^\pi) x \| &= \| x - \gamma P^\pi x \| \\ &\geq \| x \| - \gamma \| P^\pi x \| \\ &\geq \| x \| - \gamma \| x \| \\ &= (1 - \gamma) \| x \|. \end{aligned} \]↩︎