\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\cA}{\mathcal{A}}\) \(\newcommand{\cB}{\mathcal{B}}\) \(\newcommand{\cC}{\mathcal{C}}\) \(\newcommand{\cD}{\mathcal{D}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cF}{\mathcal{F}}\) \(\newcommand{\cG}{\mathcal{G}}\) \(\newcommand{\cH}{\mathcal{H}}\) \(\newcommand{\cI}{\mathcal{I}}\) \(\newcommand{\cJ}{\mathcal{J}}\) \(\newcommand{\cK}{\mathcal{K}}\) \(\newcommand{\cL}{\mathcal{L}}\) \(\newcommand{\cM}{\mathcal{M}}\) \(\newcommand{\cN}{\mathcal{N}}\) \(\newcommand{\cO}{\mathcal{O}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\cQ}{\mathcal{Q}}\) \(\newcommand{\cR}{\mathcal{R}}\) \(\newcommand{\cS}{\mathcal{S}}\) \(\newcommand{\cT}{\mathcal{T}}\) \(\newcommand{\cU}{\mathcal{U}}\) \(\newcommand{\cV}{\mathcal{V}}\) \(\newcommand{\cW}{\mathcal{W}}\) \(\newcommand{\cX}{\mathcal{X}}\) \(\newcommand{\cY}{\mathcal{Y}}\) \(\newcommand{\cZ}{\mathcal{Z}}\) \(\newcommand{\bA}{\mathbf{A}}\) \(\newcommand{\bB}{\mathbf{B}}\) \(\newcommand{\bC}{\mathbf{C}}\) \(\newcommand{\bD}{\mathbf{D}}\) \(\newcommand{\bE}{\mathbf{E}}\) \(\newcommand{\bF}{\mathbf{F}}\) \(\newcommand{\bG}{\mathbf{G}}\) \(\newcommand{\bH}{\mathbf{H}}\) \(\newcommand{\bI}{\mathbf{I}}\) \(\newcommand{\bJ}{\mathbf{J}}\) \(\newcommand{\bK}{\mathbf{K}}\) \(\newcommand{\bL}{\mathbf{L}}\) \(\newcommand{\bM}{\mathbf{M}}\) \(\newcommand{\bN}{\mathbf{N}}\) \(\newcommand{\bO}{\mathbf{O}}\) \(\newcommand{\bP}{\mathbf{P}}\) \(\newcommand{\bQ}{\mathbf{Q}}\) \(\newcommand{\bR}{\mathbf{R}}\) \(\newcommand{\bS}{\mathbf{S}}\) \(\newcommand{\bT}{\mathbf{T}}\) \(\newcommand{\bU}{\mathbf{U}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\ba}{\mathbf{a}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\bd}{\mathbf{d}}\) \(\newcommand{\be}{\mathbf{e}}\) \(\newcommand{\bg}{\mathbf{g}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bi}{\mathbf{i}}\) \(\newcommand{\bj}{\mathbf{j}}\) \(\newcommand{\bk}{\mathbf{k}}\) \(\newcommand{\bl}{\mathbf{l}}\) \(\newcommand{\bm}{\mathbf{m}}\) \(\newcommand{\bn}{\mathbf{n}}\) \(\newcommand{\bo}{\mathbf{o}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bq}{\mathbf{q}}\) \(\newcommand{\br}{\mathbf{r}}\) \(\newcommand{\bs}{\mathbf{s}}\) \(\newcommand{\bt}{\mathbf{t}}\) \(\newcommand{\bu}{\mathbf{u}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\RR}{\mathbb{R}}\) \(\newcommand{\NN}{\mathbb{N}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\hpi}{\widehat{\pi}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\hbpi}{\widehat{\boldsymbol{\pi}}}\) \(\newcommand{\bxi}{\boldsymbol{\xi}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bzero}{\mathbf{0}}\) \(\newcommand{\T}{\text{T}}\) \(\newcommand{\Trace}{\text{Trace}}\) \(\newcommand{\Cov}{\text{Cov}}\) \(\newcommand{\Var}{\text{Var}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\Pr}{\text{Pr}}\) \(\newcommand{\pr}{\text{pr}}\) \(\newcommand{\pdf}{\text{pdf}}\) \(\newcommand{\P}{\text{P}}\) \(\newcommand{\p}{\text{p}}\) \(\newcommand{\One}{\mathbf{1}}\) \(\newcommand{\argmin}{\operatorname*{arg\,min}}\) \(\newcommand{\argmax}{\operatorname*{arg\,max}}\) \(\newcommand{\dtheta}{\frac{\partial}{\partial\theta} }\) \(\newcommand{\ptheta}{\nabla_\theta}\) \(\newcommand{\alert}[1]{\color{darkorange}{#1}}\) \(\newcommand{\alertr}[1]{\color{red}{#1}}\) \(\newcommand{\alertb}[1]{\color{blue}{#1}}\)

1 Overview

In the previous finite-horizon Q-learning setup, actions could depend on the entire history of observed states, actions, and rewards. This flexible framework allowed us to learn an optimal dynamic treatment regime that adapts over time based on accumulated information. In this lecture, we shift to the setting of a Markov Decision Process (MDP, Puterman (1994)) and introduce the Bellman equation, which forms the theoretical foundation of most reinforcement learning algorithms. The MDP framework simplifies the problem by assuming that the system evolves according to the Markov property—meaning that the next state depends only on the current state and action, not on the full past trajectory. This assumption significantly reduces the complexity of the problem. By summarizing all relevant information in the current state, we can model transitions and make decisions more efficiently while still capturing the essential dynamics of the process. Furthermore, the MDP formulation typically focuses on optimizing an infinite-horizon objective, where rewards are accumulated over time with a discount applied to future outcomes. This discounted formulation emphasizes short-term outcomes while still valuing long-term effects, which is often desirable in applications such as finance and healthcare, where immediate gains or patient outcomes are generally more critical than distant future rewards.

2 Markov Decision Process

We now formalize the setup of a Markov Decision Process (MDP). Recall the basic elements from the previous lecture:

  • \(S_t \in \cS\) is the state variable at time \(t\), representing the current information about the system or patient. It can be a scalar or a vector of covariates.
  • \(A_t \in \cA\) is the action or treatment taken at time \(t\).
  • \(R_t\) is the immediate reward received after taking action \(A_t\) in state \(S_t\).

It is often assumed that the expected reward depends on the current state and action through a reward function \(r(S_t, A_t) = \E[R_t \mid S_t, A_t]\). In practice, the observed reward can include random noise around this mean. In some models, the reward may also depend on the next state, for example \(R_t = r(S_{t+1}, S_t, A_t)\), which can be interpreted as the outcome realized after the system transitions to a new state.

In the earlier finite-horizon setting, the policy \(\pi_t(\cdot)\) and the transition dynamic \(\P_t(\cdot)\) could depend on the full history of past states and actions, allowing for very flexible policies. The MDP setting simplifies this by assuming that the system satisfies two key properties:

  • Markov property: The future state depends only on the current state and the chosen action, not on the entire past history:
    \[ \P(S_{t+1} \mid S_t, A_t, S_{t-1}, A_{t-1}, \ldots, S_0, A_0) = \P(S_{t+1} \mid S_t, A_t). \]

  • Stationarity: The transition mechanism does not change over time, that is,
    \[ \P_t(\cdot) = \P(\cdot) \quad \text{for all } t. \] This implies that the same transition probabilities apply at every step.

Furthermore, we introduce a discount factor \(\gamma \in (0,1)\), which will later be used to define the value function and ensure convergence of long-term rewards. A Markov Decision Process is then defined by the tuple

\[ (\cS, \cA, \P, r, \gamma). \]

In practice, we observe \(n\) (or even just one) sequences of i.i.d. copies of \(\{ S_t, A_t, R_t \}_{t=0}^T\) generated from the MDP.

Figure 1: A graph representation of the Markov Decision Process

3 Discounted Sum of Rewards

The goal of a reinforcement learning problem is either to evaluate a given policy \(\pi\) or to find the best policy \(\pi^\ast\) that maximizes1 the expected long-term return. In the MDP framework, we will use the discounted sum of rewards as an objective to evaluate any policy. This is defined using the value function:

\[ V^\pi(s) = \E_\pi \left[ \sum_{t=0}^\infty \gamma^t R_t \mid S_0 = s \right], \]

where \(V^\pi(s)\) is the value function with \(s\) as the starting point, and the trajectory \(\{ S_t, A_t, R_t \}_{t=0}^\infty\) is generated under policy \(\pi\). The expectation is taken over the randomness in both the state transition kernel \(\P(S_{t+1} \mid S_t, A_t)\) and (if applicable) the stochastic policy \(\pi(A_t \mid S_t)\). A related and equally important concept is the action-value function, or Q-function, defined as

\[ Q^\pi(s, a) = \E_\pi \left[ \sum_{t=0}^\infty \gamma^t R_t \mid S_0 = s, A_0 = a \right]. \]

This quantity measures the expected cumulative reward if the agent starts from state \(s\), takes action \(a\) at the first step, and then continues to follow policy \(\pi\) afterward. The Q-function therefore captures the quality of a particular action in a given state.

Instead of averaging all future rewards (Wei et al. 2020; Liao et al. 2022), the use of a discount factor \(\gamma \in (0,1)\) ensures that future rewards contribute less to the total return and that the infinite series above converges . When the rewards are bounded by \(|R_t| \le R_{\max}\), the discounted sum is guaranteed to be finite:

\[ \Big| \sum_{t=0}^\infty \gamma^t R_t \Big| \le \sum_{t=0}^\infty \gamma^t R_{\max} = \frac{R_{\max}}{1 - \gamma}. \]

The discounted formulation also allows us to develop recursive relationships between \(V^\pi\) and \(Q^\pi\), which will lead to the Bellman equations introduced later. Conceptually, a smaller \(\gamma\) places more emphasis on immediate rewards, while a larger \(\gamma\) values long-term outcomes more heavily. This weighting is often desirable in real-world problems where short-term benefits typically carry greater importance than distant future gains.

4 The Bellman Equation

Due to the Markov property and stationarity, the value and Q-functions can be defined starting from any stage \(t\), and their forms will be identical to those starting from \(t = 0\). That is,

\[ V_t^\pi(s) = \E_\pi \left[ \sum_{j=0}^\infty \gamma^{j} R_{t + j} \mid S_t = s \right], \qquad Q_t^\pi(s, a) = \E_\pi \left[ \sum_{j=0}^\infty \gamma^{j} R_{t + j} \mid S_t = s, A_t = a \right]. \]

The subscript \(t\) is often omitted when the process is stationary, since these functions do not depend explicitly on time.

A central concept in reinforcement learning is the Bellman equation (Sutton and Barto 2018), which expresses a recursive relationship between the value (or Q-) function at any time \(t\) and the value function at the next step. Consider a deterministic policy \(A_t = \pi(S_t)\) (the stochastic case is similar, integrating over the policy distribution). Starting from the definition of the Q-function, we have

\[ \begin{aligned} Q^\pi(s, a) &= \E_\pi \left[ R_t + \gamma \sum_{j=1}^\infty \gamma^{j-1} R_{t+j} \mid S_t = s, A_t = a \right] \\ &= r(s, a) + \gamma \E_{S_{t+1} \sim \P(\cdot \mid s, a)} \left[ \E_\pi \Big[ \sum_{j=0}^\infty \gamma^{j} R_{t+1+j} \mid S_{t+1} \Big] \right] \\ &= r(s, a) + \gamma \E_{S_{t+1} \sim \P(\cdot \mid s, a)} \big[ V^\pi(S_{t+1}) \big]. \end{aligned} \]

This shows that the Q-function equals the immediate reward plus the discounted expected value of the next state. The corresponding recursive equation for the value function is obtained by averaging over actions under the policy \(\pi\):

\[ V^\pi(s) = \E_{A \sim \pi(\cdot \mid s)} \big[ Q^\pi(s, A) \big]. \]

Combining the two relations (substitute the second one into the first) gives the Bellman equation for the Q-function:

\[ Q^\pi(s, a) = r(s, a) + \gamma \E_{S' \sim \P(\cdot \mid s, a),\, A' \sim \pi(\cdot \mid S')} \big[ Q^\pi(S', A') \big]. \tag{1} \]

Here \((S', A')\) denote the (random) next state-action pair generated by the environment and the policy. This recursive structure forms the foundation for most reinforcement learning algorithms, including dynamic programming, temporal-difference learning, and Q-learning. If we instead integrate the entire expectation over the distribution of the action \(A\) at the current state, we obtain the Bellman expectation equation for the value function:

\[ V^\pi(s) = \E_{A \sim \pi(\cdot \mid s)} \big[ r(s, A) + \gamma \E_{ S' \sim \P(\cdot \mid s, A) } [ V^\pi(S') ] \big]. \tag{2} \]

5 Bellman Equation in the Tabular Case

To better illustrate how the Bellman equation operates, consider a finite discrete state space, that is, \(S_t \in \cS\) with \(|\cS| < \infty\). Also, we will consider a finite discrete action space \(A_t \in \cA\) with \(|\cA| < \infty\). In this case, both the transition dynamics and the value function can be expressed in matrix form and we will be able to solve for the value function exactly using linear algebra. To see this, let \(\bP^\pi\) be the \(|\cS| \times |\cS|\) transition matrix that represents the probability of moving from one state to another under policy \(\pi\). Each element \(P^\pi_{s, s'}\) gives the probability of transitioning from state \(s\) to state \(s'\) in one step when the agent follows policy \(\pi\):

\[ \begin{aligned} P^\pi_{s, s'} &= \P(S_{t+1} = s' \mid S_t = s, A_t \sim \pi(\cdot \mid s)) \\ &= \E_{A_t \sim \pi(\cdot \mid s)} \big[ \P(S_{t+1} = s' \mid S_t = s, A_t) \big] \\ &= \sum_{a \in \cA} \P(S_{t+1} = s' \mid S_t = s, A_t = a)\, \pi(a \mid s) \end{aligned} \]

where each row of \(\bP^\pi\) is a probability distribution, satisfying \(\sum_{s'} P^\pi_{s, s'} = 1\) for every \(s \in \cS\). Starting from the Bellman expectation equation defined in (2), we can expand the expectations explicitly: \[ V^\pi(s) = \sum_{a \in \cA} \pi(a \mid s) \Big[ r(s, a) + \gamma \sum_{s' \in \cS} \P(s' \mid s, a) V^\pi(s') \Big]. \] Define the expected reward under the policy at each state by \[ r^\pi(s) = \sum_{a \in \cA} \pi(a \mid s)\, r(s, a), \] and recall that the transition probability under the policy is given by \[ P^\pi_{s, s'} = \sum_{a \in \cA} \pi(a \mid s)\, \P(s' \mid s, a). \] Substituting these definitions into the previous equation yields the simplified scalar Bellman equation \[ V^\pi(s) = r^\pi(s) + \gamma \sum_{s' \in \cS} P^\pi_{s, s'}\, V^\pi(s'). \]

Because there are finitely many states, we can stack all value functions into a vector \(\bv^\pi = (V^\pi(s_1), \ldots, V^\pi(s_{|\cS|}))^\top\), and similarly define the reward vector \(\br^\pi = (r^\pi(s_1), \ldots, r^\pi(s_{|\cS|}))^\top\). Then the Bellman equation for all states can be written compactly as \[ \bv^\pi = \br^\pi + \gamma \bP^\pi \bv^\pi. \tag{3} \] This is a system of \(|\cS|\) linear equations with \(|\cS|\) unknowns. Rearranging terms gives \[ (I - \gamma \bP^\pi)\, \bv^\pi = \br^\pi, \] and when \(I - \gamma \P^\pi\) is invertible2, the unique solution is \[ \bv^\pi = (I - \gamma \bP^\pi)^{-1} \br^\pi. \]

This matrix formulation provides a compact and useful representation of the Bellman equation in the tabular case. When both \(\P^\pi\) and \(r^\pi\) are known, one can compute the exact value function through direct matrix inversion. This is the task of policy evaluation. In practice, however, \(\P^\pi\) and \(r^\pi\) are usually unknown and must be estimated from data, motivating algorithmic approaches such as Monte Carlo evaluation, temporal-difference learning, and Q-learning.

6 Example: Policy Evaluation with True Model Generator

Let’s consider a simple three-state MDP and two treatment actions. When the treatment is 1, the transition probability matrix associated with \(\P(S_{t+1} = s' | S_t = s, A_t = 1 )\) is

\[ P_1 = \begin{pmatrix} 0.9 & 0.075 & 0.025 \\ 0.15 & 0.8 & 0.05 \\ 0.25 & 0.25 & 0.5 \end{pmatrix} \]

The transition probability matrix associated with \(\P(S_{t+1} = s' | S_t = s, A_t = 2 )\) is

\[ P_2 = \begin{pmatrix} 0.8 & 0.1 & 0.1 \\ 0.2 & 0.6 & 0.2 \\ 0.3 & 0.3 & 0.4 \end{pmatrix} \]

We assume a stochastic policy with two possible actions, where the probability of taking the first action at each state is

\[ \pi = \begin{pmatrix} 1/2 & 1/2 \\ 2/3 & 1/3 \\ 1/3 & 2/3 \end{pmatrix} \]

Then the expected transition matrix is

\[ P = \begin{pmatrix} 0.85 & 0.0875 & 0.0625 \\ 0.1667 & 0.7333 & 0.1 \\ 0.2833 & 0.2833 & 0.4333 \end{pmatrix} \]

We further define the reward function \(r(s, a)\) for each state-action pair as

\[ r = \begin{pmatrix} 1 & 3 \\ 2 & 4 \\ 3 & 1 \end{pmatrix} \]

Using Equation (3), we can solve for the value function.

  # Define the transition probability matrix and rewards as before
  P1 <- matrix(c(0.9, 0.075, 0.025,
                 0.15, 0.8, 0.05,
                 0.25, 0.25, 0.5), byrow=TRUE, nrow=3)

  P2 <- matrix(c(0.8, 0.1, 0.1,
                 0.2, 0.6, 0.2,
                 0.3, 0.3, 0.4), byrow=TRUE, nrow=3)

  # Define a stochastic policy
  # The first row means that we will take action 1 with probability 1/2, action 2 with probability 1/2
  policy <- matrix(c(1/2, 1/2,
                     2/3, 1/3,
                     1/3, 2/3), byrow=TRUE, nrow=3)

  # Define the expected transition matrix
  P = sweep(P1, 1, policy[,1], "*") + sweep(P2, 1, policy[,2], "*")

  # define the mean reward function for each state and action pair
  r = matrix(c(1, 3,
               2, 4,
               3, 1), byrow=TRUE, nrow=3)

  # define the expected reward for each state
  r_expected = rowSums(r * policy)
  
  # define the discount factor
  gamma = 0.7
  
  # Solve for the value function using matrix inversion
  I = diag(3)
  v = solve(I - gamma * P) %*% r_expected
  v
##          [,1]
## [1,] 6.879767
## [2,] 8.085625
## [3,] 6.652827

7 Monte Carlo approximation of Policy Value

Its not clear if this is correct. But we could perform some simulated data to validate the result. For demonstration, let’s simulation the discounted cumulative reward starting at state 2. This is called a Monte Carlo method.

  # write a function to simulate the MDP
  simulate_MDP <- function(n, policy, P1, P2, r, gamma, s0){
    # prepare the data
    S = rep(NA, n)
    A = rep(NA, n)
    R = rep(NA, n)
    
    # lets start from the first state
    S[1] = s0
    
    # sequential simulation
    for (t in 1:n){
      
      # take an action based on the policy A_t
      A[t] = sample(1:2, 1, prob = policy[S[t],])
      
      # get the reward r(S_t, A_t) with a bit of noise
      R[t] = r[S[t], A[t]] + rnorm(1)*0.1
      
      if (t < n){
        # get the next state depending on the action
        if (A[t] == 1){
          S[t+1] = sample(1:3, 1, prob = P1[S[t],])
        } else {
          S[t+1] = sample(1:3, 1, prob = P2[S[t],])
        }
      }
    }
  
    #record the cumulative discounted reward
    return(list(S = S, A = A, R = R, cum_reward = sum(gamma^seq(0, n-1) %*% R)))
  }

  # Simulate the data
  # generate 100 simulations
  nsim = 1000
  cum_reward = rep(NA, nsim)
  set.seed(1)
    
  for (k in 1:nsim)
  {
    #record the cumulative discounted reward
    cum_reward[k] = simulate_MDP(60, policy, P1, P2, r, gamma, 2)$cum_reward
  }
  
  mean(cum_reward)
## [1] 8.043472

This is very close to the true policy value 8.0856248.

8 Example: Policy Evaluation with Observed Data

In practice, we do not directly observe the transition matrix and the expected reward. Instead, we have a sequence of observed data \(\{ S_t, A_t, R_t \}_{t=0}^T\). We can use this data to estimate the transition matrix and the expected reward and plug them into the Bellman equation. Let’s look at an example in which we first observe a sequence of data using the same policy defined previously, and then we want to estimate the policy value if we follow a new policy

\[ \pi_\text{new} = \begin{pmatrix} 1/4 & 3/4 \\ 1/2 & 1/2 \\ 3/3 & 1/3 \end{pmatrix} \]

First, we estimate \(P_1\), \(P_2\) and \(r\) based on the observed data.

  # Simulate a sequence of 1000 observations 
  # It doesn't really matter which initial state we start from
  mydate = simulate_MDP(5000, policy, P1, P2, r, gamma, 2)
  S = mydate$S
  A = mydate$A
  R = mydate$R

  # Estimate the transition matrix
  P1_hat = matrix(0, nrow=3, ncol=3)
  P2_hat = matrix(0, nrow=3, ncol=3)
  P1_count = 0
  P2_count = 0
  
  # estimate the mean reward for each state and action pair
  r_hat = matrix(0, nrow=3, ncol=2)
  r_count = matrix(0, nrow=3, ncol=2)
  
  for (t in 1:(length(S)-1)){
    
    # for each time point, we count the transition
    if (A[t] == 1){
      
      P1_hat[S[t], S[t+1]] = P1_hat[S[t], S[t+1]] + 1

    } else {
      
      P2_hat[S[t], S[t+1]] = P2_hat[S[t], S[t+1]] + 1
      
    }
    
    # add and count the reward
    r_hat[S[t], A[t]] = r_hat[S[t], A[t]] + R[t]
    r_count[S[t], A[t]] = r_count[S[t], A[t]] + 1
    
  }
  
  # calculate the observed transition matrix
  P1_hat = P1_hat / rowSums(P1_hat)
  P2_hat = P2_hat / rowSums(P2_hat)
  
  # calculate the mean reward for each state and action pair
  r_hat = r_hat / r_count
  
  # Print the estimated quantities
  P1_hat
##           [,1]       [,2]       [,3]
## [1,] 0.9101754 0.06877193 0.02105263
## [2,] 0.1441006 0.80464217 0.05125725
## [3,] 0.2197309 0.30493274 0.47533632
  P2_hat
##           [,1]       [,2]      [,3]
## [1,] 0.7895100 0.09592823 0.1145618
## [2,] 0.2374245 0.56539235 0.1971831
## [3,] 0.3180593 0.30188679 0.3800539
  r_hat
##           [,1]     [,2]
## [1,] 0.9994502 3.000719
## [2,] 2.0002712 3.998590
## [3,] 2.9867864 1.001137

Then we define the new policy and use the Bellman equation to estimate its policy value

  # Define the new policy
  policy_new = matrix(c(1/4, 3/4,
                        1/2, 1/2,
                        1/3, 2/3), byrow=TRUE, nrow=3)

  # calculate the expected transition matrix and reward
  r_expected_hat = rowSums(r_hat * policy_new)
  P_hat = sweep(P1_hat, 1, policy_new[,1], "*") + sweep(P2_hat, 1, policy_new[,2], "*")
  
  # Solve for the value function using matrix inversion
  v_hat = solve(I - gamma * P_hat) %*% r_expected_hat
  v_hat
##          [,1]
## [1,] 8.309056
## [2,] 9.128629
## [3,] 7.387469

Let’s validate this using new simulated data

  # Simulate the data
  # generate 100 simulations
  nsim = 1000
  cum_reward = rep(NA, nsim)
  set.seed(1)
    
  for (k in 1:nsim)
  {
    #record the cumulative discounted reward
    cum_reward[k] = simulate_MDP(60, policy_new, P1, P2, r, gamma, 3)$cum_reward
  }
  
  mean(cum_reward)
## [1] 7.278483

We can see that our estimated policy value at state 3 is 7.3874687, while the simulated truth is 7.2784831.


Liao, Peng, Zhengling Qi, Runzhe Wan, Predrag Klasnja, and Susan A Murphy. 2022. “Batch Policy Learning in Average Reward Markov Decision Processes.” Annals of Statistics 50 (6): 3364.
Puterman, Martin L. 1994. “Markov Decision Processes.” Wiley Series in Probability and Statistics.
Sutton, Richard S, and Andrew G Barto. 2018. Reinforcement Learning: An Introduction. MIT press.
Wei, Chen-Yu, Mehdi Jafarnia Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. 2020. “Model-Free Reinforcement Learning in Infinite-Horizon Average-Reward Markov Decision Processes.” In International Conference on Machine Learning, 10170–80. PMLR.

  1. In this setting, the policy we are interested would also be stationary, i.e., \(\pi_t(\cdot) = \pi(\cdot)\) for all \(t\). Under the Markov and stationarity assumptions, the current state \(S_t\) contains all relevant information needed for future decision making. Consequently, the optimal policy (a common goal in reinforcement learning) does not need to depend on the entire history but only on the current state. In fact, for the discounted infinite-horizon problem, it can be shown that there always exists an optimal deterministic stationary policy \(\pi^\ast(S_t)\) that maximizes the long-term expected return. This result is fundamental in dynamic programming and reinforcement learning. These will become more clear after we introduce some related concepts.↩︎

  2. To see why \(I - \gamma P^\pi\) is invertible, note that every row of \(P^\pi\) sums to one (row-stochastic), it is a contraction in \(\|\cdot\|_\infty\): \(\|P^\pi x\|_\infty \le \|x\|_\infty\) for all \(x\). Then if \((I-\gamma P^\pi)x=0\) for some \(x\), then \(x=\gamma P^\pi x\). Taking \(\|\cdot\|_\infty\), \[ \|x\|_\infty \;=\; \gamma \|P^\pi x\|_\infty \;\le\; \gamma \|x\|_\infty. \] Thus \((1-\gamma)\|x\|_\infty \le 0\). With \(0<\gamma<1\), this forces \(\|x\|_\infty=0\), hence \(x=0\). So the null space is trivial and \(I-\gamma P^\pi\) is invertible.↩︎