\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\cA}{\mathcal{A}}\) \(\newcommand{\cB}{\mathcal{B}}\) \(\newcommand{\cC}{\mathcal{C}}\) \(\newcommand{\cD}{\mathcal{D}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cF}{\mathcal{F}}\) \(\newcommand{\cG}{\mathcal{G}}\) \(\newcommand{\cH}{\mathcal{H}}\) \(\newcommand{\cI}{\mathcal{I}}\) \(\newcommand{\cJ}{\mathcal{J}}\) \(\newcommand{\cK}{\mathcal{K}}\) \(\newcommand{\cL}{\mathcal{L}}\) \(\newcommand{\cM}{\mathcal{M}}\) \(\newcommand{\cN}{\mathcal{N}}\) \(\newcommand{\cO}{\mathcal{O}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\cQ}{\mathcal{Q}}\) \(\newcommand{\cR}{\mathcal{R}}\) \(\newcommand{\cS}{\mathcal{S}}\) \(\newcommand{\cT}{\mathcal{T}}\) \(\newcommand{\cU}{\mathcal{U}}\) \(\newcommand{\cV}{\mathcal{V}}\) \(\newcommand{\cW}{\mathcal{W}}\) \(\newcommand{\cX}{\mathcal{X}}\) \(\newcommand{\cY}{\mathcal{Y}}\) \(\newcommand{\cZ}{\mathcal{Z}}\) \(\newcommand{\bA}{\mathbf{A}}\) \(\newcommand{\bB}{\mathbf{B}}\) \(\newcommand{\bC}{\mathbf{C}}\) \(\newcommand{\bD}{\mathbf{D}}\) \(\newcommand{\bE}{\mathbf{E}}\) \(\newcommand{\bF}{\mathbf{F}}\) \(\newcommand{\bG}{\mathbf{G}}\) \(\newcommand{\bH}{\mathbf{H}}\) \(\newcommand{\bI}{\mathbf{I}}\) \(\newcommand{\bJ}{\mathbf{J}}\) \(\newcommand{\bK}{\mathbf{K}}\) \(\newcommand{\bL}{\mathbf{L}}\) \(\newcommand{\bM}{\mathbf{M}}\) \(\newcommand{\bN}{\mathbf{N}}\) \(\newcommand{\bO}{\mathbf{O}}\) \(\newcommand{\bP}{\mathbf{P}}\) \(\newcommand{\bQ}{\mathbf{Q}}\) \(\newcommand{\bR}{\mathbf{R}}\) \(\newcommand{\bS}{\mathbf{S}}\) \(\newcommand{\bT}{\mathbf{T}}\) \(\newcommand{\bU}{\mathbf{U}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\ba}{\mathbf{a}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\bd}{\mathbf{d}}\) \(\newcommand{\be}{\mathbf{e}}\) \(\newcommand{\bg}{\mathbf{g}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bi}{\mathbf{i}}\) \(\newcommand{\bj}{\mathbf{j}}\) \(\newcommand{\bk}{\mathbf{k}}\) \(\newcommand{\bl}{\mathbf{l}}\) \(\newcommand{\bm}{\mathbf{m}}\) \(\newcommand{\bn}{\mathbf{n}}\) \(\newcommand{\bo}{\mathbf{o}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bq}{\mathbf{q}}\) \(\newcommand{\br}{\mathbf{r}}\) \(\newcommand{\bs}{\mathbf{s}}\) \(\newcommand{\bt}{\mathbf{t}}\) \(\newcommand{\bu}{\mathbf{u}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\hpi}{\widehat{\pi}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\hbpi}{\widehat{\boldsymbol{\pi}}}\) \(\newcommand{\bxi}{\boldsymbol{\xi}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bzero}{\mathbf{0}}\) \(\newcommand{\T}{\text{T}}\) \(\newcommand{\Trace}{\text{Trace}}\) \(\newcommand{\Cov}{\text{Cov}}\) \(\newcommand{\Var}{\text{Var}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\Pr}{\text{Pr}}\) \(\newcommand{\pr}{\text{pr}}\) \(\newcommand{\pdf}{\text{pdf}}\) \(\newcommand{\P}{\text{P}}\) \(\newcommand{\p}{\text{p}}\) \(\newcommand{\One}{\mathbf{1}}\) \(\newcommand{\argmin}{\operatorname*{arg\,min}}\) \(\newcommand{\argmax}{\operatorname*{arg\,max}}\)

Overview

Previously in the Q-learning example, we have introduced a framework of how the state variable, action and reward evolves over time, and how we can use the data to estimate the optimal dynamic treatment regime. In this lecture, we will introduce the concept of Markov Decision Process (MDP, Puterman (1994)) and the Bellman Equation, which is the foundation of many reinforcement learning algorithms. The MDP setting simplifies the previous setting by assuming that the state variable follows a Markov property, which means that the future state only depends on the current state and action, rather than the entire history. This simplification allows us to estimate things more efficiently because we can now borrow the information from the past to estimate the future.

Markov Decision Process

In the previous lecture, we have introduced the following notation:

  • \(S_t\) is the state variable at time \(t\), which can be a vector of covariates.
  • \(A_t\) is the action taken at time \(t\), which can be a treatment assignment.
  • \(R_t\) is the reward at time \(t\). We often assume that the reward is a function of the state and action, i.e., \(R_t = r(S_t, A_t)\). It is also common to assume that \(R_t\) incorporates some random noise on top of the deterministic function \(r(\cdot)\). Furthermore, we can also allow the reward to depend on the future state, i.e., \(R_t = r(S_{t+1}, S_t, A_t)\)1.

In our Q-learning setting, our decision \(\mu_t(\cdot)\) and transition probability \(\P_t(\cdot)\) can depend on the entire history of states and previous actions. This is a general setting that allows for a very flexible decision rule. However, in the MDP setting, we will assume that the decision rule only depends on the current state, i.e., \(A_t = \mu_t(S_t)\) and the transition only depends on the current state and action, i.e., \(S_{t+1} \sim \P(\cdot | S_t, A_t)\). These are enabled by the Markov property,

\[ \P(S_{t+1} | S_t, A_t, S_{t-1}, A_{t-1}, \ldots, S_0, A_0) = \P(S_{t+1} | S_t, A_t). \]

and we will also assume the stationary probability, i.e., \(\P_t(\cdot) = \P(\cdot)\) for all \(t\). In addition, we will introduce a discount factor

  • \(\gamma \in [0, 1)\) to discount the future reward

And overall, a MDP is specified by \((\cS, \cA, \P, r, \gamma )\).

Figure 1: A graph representation of the Markov Decision Process

Discounted Sum of Rewards

The goal of the MDP is to find a policy \(\pi\) that maximizes the expected discounted sum of rewards, which is defined as

\[ \cV^\pi(s) = \E_\pi \left[ \sum_{t=0}^\infty \gamma^t R_t | S_0 = s \right], \]

where the expectation is taken under the distribution \(\pi(A_t | S_t )\). Why are we interested in the discounted reward setting, rather than the average reward, e.g., Wei et al. (2020) and Liao et al. (2022)? This is mainly motivated from an infinite return. To see this, when the reward \(R_t\) is bounded with \(|R_t| \leq R_\text{max}\), the above discounted sum of rewards is also bounded due to

\[ \Big| \sum_{t=0}^\infty \gamma^t R_t \Big| \leq \sum_{t=0}^\infty \gamma^t R_\text{max} = \frac{R_\text{max}}{1-\gamma}. \]

Also, using a discount factor would allow us to develop a recursive equation which is easier to solve. Moreover, paying more attention to the immediate reward is also more practical in many applications, e.g., in finance and health, the immediate reward is more important than the future reward. Smaller \(\gamma\) means larger discount, and hence the future reward becomes less important.

Bellman Equation

The Q-functions we defined previously would naturally be adapted to the MDP setting:

\[ Q^\pi(s, a) = \E_\pi \left[ \sum_{t=0}^\infty \gamma^t R_t | S_0 = s, A_0 = a \right]. \]

We can also define the value and Q-functions at stage \(t\) as

\[ \cV^\pi_t(s) = \E_\pi \left[ \sum_{j=0}^\infty \gamma^{j} R_{t + j} | S_t = s \right], \]

and

\[ Q^\pi_t(s, a) = \E_\pi \left[ \sum_{j=0}^\infty \gamma^{j} R_{t + j} | S_t = s, A_t = a \right]. \]

Probably the most important tool in reinforcement learning is the Bellman equation (Bellman 1966; Sutton and Barto 2018). The Bellman equation is a recursive equation that relates the value (Q) function at time \(t\) to the value function at time \(t+1\). We have seen a similar equation in the Q-learning setting, but here we re-state it in the MDP setting, essentially just removing the long history chain. Let’s consider a deterministic policy with \(A_t = \pi(S_t)\)2. Noticing that the Q-function can be written as

\[ \begin{aligned} Q^\pi_t(s, a) &= \E_\pi \left[ R_t + \gamma \sum_{j=1}^\infty \gamma^{j} R_{t+j} | S_t = s, A_t = a \right] \\ &= r(s, a) + \gamma \E_\pi \left[ \sum_{j=1}^\infty \gamma^{j-1} R_{t+j} | S_t = s, A_t = a \right] \\ &= r(s, a) + \gamma \E_{S_{t+1} \sim P(\cdot| s, a)} \left[ \E_\pi \bigg[ \sum_{j=0}^\infty \gamma^{j} R_{t+1+j} | S_{t+1}, S_t = s, A_t = a \bigg] \right] \\ &= r(s, a) + \gamma \E_{S_{t+1} \sim P(\cdot| s, a)} \big[ \cV^\pi(S_{t+1}) \big] \\ \end{aligned} \]

Which demonstrates that the Q-function at time \(t\) can be written as the immediate reward plus the expected future reward. The notation is essentially irrelevant to the choice of \(t\) if we assume the stationary property. To better illustrate how a Bellman equation can be useful, it would be interesting to consider a finite discrete state space, i.e., \(S_t \in \cS\), with \(|\cS| < \infty\). Let’s define a transition matrix \(P^\pi\). This is a \(|\cS| \times |\cS|\) dimensional matrix that describes the probability of transitioning from state \(s\) at time \(t\) to state \(s'\) at time \(t + 1\) under policy \(\pi\). The element \(P^\pi_{s, s'}\) is defined as

\[ \begin{aligned} P^\pi_{s, s'} &= \P(S_{t+1} = s' | S_t = s, A_t \sim \pi(s) ) \\ &= \E_{A_t \sim \pi(S_t)} \left[ \P(S_{t+1} = s' | S_t = s, A_t) \right] \\ &= \sum_{a \in \cA} \P(S_{t+1} = s' | S_t = s, A_t = a) \P(A_t = a | S_t = s) \qquad (\text{if } A_t \text{ is discrete}) \end{aligned} \]

Since we have a finite number of states, the value function can only take a finite number of values. Hence, we can write the value function as a vector \(\bv^\pi \in \bR^{|\cS|}\), where the \(i\)-th element is the value function at state \(i\). The Bellman equation can be written collectively as

\[ \bv^\pi = \br^\pi + \gamma P^\pi \bv^\pi, \tag{1} \]

where

  • \(\br^\pi\) is a reward vector, with the \(i\)-th element being the expected reward at state \(i\), i.e., \(\E_{A_t}[r(s_i, A_t)]\).
  • Each row of \(P^\pi\) is a probability distribution vector, i.e., \(\sum_{s'} P^\pi_{s, s'} = 1\) for any \(s\). Hence, the inner product between the \(i\)-th row of \(P^\pi\) and \(\bv^\pi\) is the expected value of the next state, i.e., \(\E_{S_{t+1}}[ \bv^\pi(S_{t+1}) | S_t = s_i, A_t \sim \pi(s_i) ]\).

This equation is a linear equation, and we can solve it using matrix inversion, assuming that \(I - \gamma P^\pi\) is invertible3. The solution is

\[ \bv^\pi = (I - \gamma P^\pi)^{-1} \br^\pi. \]

In practice, this formulation is a bit unrealistic because we often do not have the transition matrix \(P^\pi\) and the expected reward vector \(\br^\pi\). However, assuming if we do, then we can perform a task of policy evaluation, i.e., estimating the value function under a given policy. Still, in general, we will need some other algorithms to estimate the value function.

Example: Policy Evaluation with True Model Generator

Let’s consider a simple three-state MDP and two treatment actions. When the treatment is 1, the transition probability matrix associated with \(\P(S_{t+1} = s' | S_t = s, A_t = 1 )\) is

\[ P_1 = \begin{pmatrix} 0.9 & 0.075 & 0.025 \\ 0.15 & 0.8 & 0.05 \\ 0.25 & 0.25 & 0.5 \end{pmatrix} \]

The transition probability matrix associated with \(\P(S_{t+1} = s' | S_t = s, A_t = 2 )\) is

\[ P_2 = \begin{pmatrix} 0.8 & 0.1 & 0.1 \\ 0.2 & 0.6 & 0.2 \\ 0.3 & 0.3 & 0.4 \end{pmatrix} \]

We assume a stochastic policy with two possible actions, where the probability of taking the first action at each state is

\[ \pi = \begin{pmatrix} 1/2 & 1/2 \\ 2/3 & 1/3 \\ 1/3 & 2/3 \end{pmatrix} \]

Then the expected transition matrix is

\[ P = \begin{pmatrix} 0.85 & 0.0875 & 0.0625 \\ 0.1667 & 0.7333 & 0.1 \\ 0.2833 & 0.2833 & 0.4333 \end{pmatrix} \]

We further define the reward function \(r(s, a)\) for each state-action pair as

\[ r = \begin{pmatrix} 1 & 3 \\ 2 & 4 \\ 3 & 1 \end{pmatrix} \]

Using Equation (1), we can solve for the value function.

  # Define the transition probability matrix and rewards as before
  P1 <- matrix(c(0.9, 0.075, 0.025,
                 0.15, 0.8, 0.05,
                 0.25, 0.25, 0.5), byrow=TRUE, nrow=3)

  P2 <- matrix(c(0.8, 0.1, 0.1,
                 0.2, 0.6, 0.2,
                 0.3, 0.3, 0.4), byrow=TRUE, nrow=3)

  # Define a stochastic policy
  # The first row means that we will take action 1 with probability 1/2, action 2 with probability 1/2
  policy <- matrix(c(1/2, 1/2,
                     2/3, 1/3,
                     1/3, 2/3), byrow=TRUE, nrow=3)

  # Define the expected transition matrix
  P = sweep(P1, 1, policy[,1], "*") + sweep(P2, 1, policy[,2], "*")

  # define the mean reward function for each state and action pair
  r = matrix(c(1, 3,
               2, 4,
               3, 1), byrow=TRUE, nrow=3)

  # define the expected reward for each state
  r_expected = rowSums(r * policy)
  
  # define the discount factor
  gamma = 0.7
  
  # Solve for the value function using matrix inversion
  I = diag(3)
  v = solve(I - gamma * P) %*% r_expected
  v
##          [,1]
## [1,] 6.879767
## [2,] 8.085625
## [3,] 6.652827

Monte Carlo approximation of Policy Value

Its not clear if this is correct. But we could perform some simulated data to validate the result. For demonstration, let’s simulation the discounted cumulative reward starting at state 2. This is called a Monte Carlo method.

  # write a function to simulate the MDP
  simulate_MDP <- function(n, policy, P1, P2, r, gamma, s0){
    # prepare the data
    S = rep(NA, n)
    A = rep(NA, n)
    R = rep(NA, n)
    
    # lets start from the first state
    S[1] = s0
    
    # sequential simulation
    for (t in 1:n){
      
      # take an action based on the policy A_t
      A[t] = sample(1:2, 1, prob = policy[S[t],])
      
      # get the reward r(S_t, A_t) with a bit of noise
      R[t] = r[S[t], A[t]] + rnorm(1)*0.1
      
      if (t < n){
        # get the next state depending on the action
        if (A[t] == 1){
          S[t+1] = sample(1:3, 1, prob = P1[S[t],])
        } else {
          S[t+1] = sample(1:3, 1, prob = P2[S[t],])
        }
      }
    }
  
    #record the cumulative discounted reward
    return(list(S = S, A = A, R = R, cum_reward = sum(gamma^seq(0, n-1) %*% R)))
  }

  # Simulate the data
  # generate 100 simulations
  nsim = 1000
  cum_reward = rep(NA, nsim)
  set.seed(1)
    
  for (k in 1:nsim)
  {
    #record the cumulative discounted reward
    cum_reward[k] = simulate_MDP(60, policy, P1, P2, r, gamma, 2)$cum_reward
  }
  
  mean(cum_reward)
## [1] 8.043472

This is very close to the true policy value 8.0856248.

Example: Policy Evaluation with Observed Data

In practice, we do not directly observe the transition matrix and the expected reward. Instead, we have a sequence of observed data \(\{ S_t, A_t, R_t \}_{t=0}^T\). We can use this data to estimate the transition matrix and the expected reward and plug them into the Bellman equation. Let’s look at an example in which we first observe a sequence of data using the same policy defined previously, and then we want to estimate the policy value if we follow a new policy

\[ \pi_\text{new} = \begin{pmatrix} 1/4 & 3/4 \\ 1/2 & 1/2 \\ 3/3 & 1/3 \end{pmatrix} \]

First, we estimate \(P_1\), \(P_2\) and \(r\) based on the observed data.

  # Simulate a sequence of 1000 observations 
  # It doesn't really matter which initial state we start from
  mydate = simulate_MDP(5000, policy, P1, P2, r, gamma, 2)
  S = mydate$S
  A = mydate$A
  R = mydate$R

  # Estimate the transition matrix
  P1_hat = matrix(0, nrow=3, ncol=3)
  P2_hat = matrix(0, nrow=3, ncol=3)
  P1_count = 0
  P2_count = 0
  
  # estimate the mean reward for each state and action pair
  r_hat = matrix(0, nrow=3, ncol=2)
  r_count = matrix(0, nrow=3, ncol=2)
  
  for (t in 1:(length(S)-1)){
    
    # for each time point, we count the transition
    if (A[t] == 1){
      
      P1_hat[S[t], S[t+1]] = P1_hat[S[t], S[t+1]] + 1

    } else {
      
      P2_hat[S[t], S[t+1]] = P2_hat[S[t], S[t+1]] + 1
      
    }
    
    # add and count the reward
    r_hat[S[t], A[t]] = r_hat[S[t], A[t]] + R[t]
    r_count[S[t], A[t]] = r_count[S[t], A[t]] + 1
    
  }
  
  # calculate the observed transition matrix
  P1_hat = P1_hat / rowSums(P1_hat)
  P2_hat = P2_hat / rowSums(P2_hat)
  
  # calculate the mean reward for each state and action pair
  r_hat = r_hat / r_count
  
  # Print the estimated quantities
  P1_hat
##           [,1]       [,2]       [,3]
## [1,] 0.9101754 0.06877193 0.02105263
## [2,] 0.1441006 0.80464217 0.05125725
## [3,] 0.2197309 0.30493274 0.47533632
  P2_hat
##           [,1]       [,2]      [,3]
## [1,] 0.7895100 0.09592823 0.1145618
## [2,] 0.2374245 0.56539235 0.1971831
## [3,] 0.3180593 0.30188679 0.3800539
  r_hat
##           [,1]     [,2]
## [1,] 0.9994502 3.000719
## [2,] 2.0002712 3.998590
## [3,] 2.9867864 1.001137

Then we define the new policy and use the Bellman equation to estimate its policy value

  # Define the new policy
  policy_new = matrix(c(1/4, 3/4,
                        1/2, 1/2,
                        1/3, 2/3), byrow=TRUE, nrow=3)

  # calculate the expected transition matrix and reward
  r_expected_hat = rowSums(r_hat * policy_new)
  P_hat = sweep(P1_hat, 1, policy_new[,1], "*") + sweep(P2_hat, 1, policy_new[,2], "*")
  
  # Solve for the value function using matrix inversion
  v_hat = solve(I - gamma * P_hat) %*% r_expected_hat
  v_hat
##          [,1]
## [1,] 8.309056
## [2,] 9.128629
## [3,] 7.387469

Let’s validate this using new simulated data

  # Simulate the data
  # generate 100 simulations
  nsim = 1000
  cum_reward = rep(NA, nsim)
  set.seed(1)
    
  for (k in 1:nsim)
  {
    #record the cumulative discounted reward
    cum_reward[k] = simulate_MDP(60, policy_new, P1, P2, r, gamma, 3)$cum_reward
  }
  
  mean(cum_reward)
## [1] 7.278483

We can see that our estimated policy value at state 3 is 7.3874687, while the simulated truth is 7.2784831.

Reference

Bellman, Richard. 1966. “Dynamic Programming.” Science 153 (3731): 34–37.
Liao, Peng, Zhengling Qi, Runzhe Wan, Predrag Klasnja, and Susan A Murphy. 2022. “Batch Policy Learning in Average Reward Markov Decision Processes.” Annals of Statistics 50 (6): 3364.
Puterman, Martin L. 1994. “Markov Decision Processes.” Wiley Series in Probability and Statistics.
Sutton, Richard S, and Andrew G Barto. 2018. Reinforcement Learning: An Introduction. MIT press.
Wei, Chen-Yu, Mehdi Jafarnia Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. 2020. “Model-Free Reinforcement Learning in Infinite-Horizon Average-Reward Markov Decision Processes.” In International Conference on Machine Learning, 10170–80. PMLR.

  1. Based on the model assumption of MDP, we can write \(r(S_t, A_t) = \int_{s_{t+1}} r_t(S_{t+1}, S_t, A_t) \P(S_{t+1} | S_t, A_t) d S_{t+1}\)↩︎

  2. A stochastic case is similar by properly integrate out A_t based on its conditional distribution.↩︎

  3. In fact, this can be shown by proving that any eign-value of \(I - \gamma P^\pi\) is nonzero. To do this, we can show that the matrix does not diminish the size of any non-zero vector (if \(x \neq 0\), then \(Ax \neq 0\)). We can see that for any vector \(x \neq 0\), \[ \begin{aligned} \| (I - \gamma P^\pi) x \| &= \| x - \gamma P^\pi x \| \\ &\geq \| x \| - \gamma \| P^\pi x \| \\ &\geq \| x \| - \gamma \| x \| \\ &= (1 - \gamma) \| x \|. \end{aligned} \]↩︎