\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\cA}{\mathcal{A}}\) \(\newcommand{\cB}{\mathcal{B}}\) \(\newcommand{\cC}{\mathcal{C}}\) \(\newcommand{\cD}{\mathcal{D}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cF}{\mathcal{F}}\) \(\newcommand{\cG}{\mathcal{G}}\) \(\newcommand{\cH}{\mathcal{H}}\) \(\newcommand{\cI}{\mathcal{I}}\) \(\newcommand{\cJ}{\mathcal{J}}\) \(\newcommand{\cK}{\mathcal{K}}\) \(\newcommand{\cL}{\mathcal{L}}\) \(\newcommand{\cM}{\mathcal{M}}\) \(\newcommand{\cN}{\mathcal{N}}\) \(\newcommand{\cO}{\mathcal{O}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\cQ}{\mathcal{Q}}\) \(\newcommand{\cR}{\mathcal{R}}\) \(\newcommand{\cS}{\mathcal{S}}\) \(\newcommand{\cT}{\mathcal{T}}\) \(\newcommand{\cU}{\mathcal{U}}\) \(\newcommand{\cV}{\mathcal{V}}\) \(\newcommand{\cW}{\mathcal{W}}\) \(\newcommand{\cX}{\mathcal{X}}\) \(\newcommand{\cY}{\mathcal{Y}}\) \(\newcommand{\cZ}{\mathcal{Z}}\) \(\newcommand{\bA}{\mathbf{A}}\) \(\newcommand{\bB}{\mathbf{B}}\) \(\newcommand{\bC}{\mathbf{C}}\) \(\newcommand{\bD}{\mathbf{D}}\) \(\newcommand{\bE}{\mathbf{E}}\) \(\newcommand{\bF}{\mathbf{F}}\) \(\newcommand{\bG}{\mathbf{G}}\) \(\newcommand{\bH}{\mathbf{H}}\) \(\newcommand{\bI}{\mathbf{I}}\) \(\newcommand{\bJ}{\mathbf{J}}\) \(\newcommand{\bK}{\mathbf{K}}\) \(\newcommand{\bL}{\mathbf{L}}\) \(\newcommand{\bM}{\mathbf{M}}\) \(\newcommand{\bN}{\mathbf{N}}\) \(\newcommand{\bO}{\mathbf{O}}\) \(\newcommand{\bP}{\mathbf{P}}\) \(\newcommand{\bQ}{\mathbf{Q}}\) \(\newcommand{\bR}{\mathbf{R}}\) \(\newcommand{\bS}{\mathbf{S}}\) \(\newcommand{\bT}{\mathbf{T}}\) \(\newcommand{\bU}{\mathbf{U}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\ba}{\mathbf{a}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\bd}{\mathbf{d}}\) \(\newcommand{\be}{\mathbf{e}}\) \(\newcommand{\bg}{\mathbf{g}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bi}{\mathbf{i}}\) \(\newcommand{\bj}{\mathbf{j}}\) \(\newcommand{\bk}{\mathbf{k}}\) \(\newcommand{\bl}{\mathbf{l}}\) \(\newcommand{\bm}{\mathbf{m}}\) \(\newcommand{\bn}{\mathbf{n}}\) \(\newcommand{\bo}{\mathbf{o}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bq}{\mathbf{q}}\) \(\newcommand{\br}{\mathbf{r}}\) \(\newcommand{\bs}{\mathbf{s}}\) \(\newcommand{\bt}{\mathbf{t}}\) \(\newcommand{\bu}{\mathbf{u}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\hpi}{\widehat{\pi}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\hbpi}{\widehat{\boldsymbol{\pi}}}\) \(\newcommand{\bxi}{\boldsymbol{\xi}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bzero}{\mathbf{0}}\) \(\newcommand{\T}{\text{T}}\) \(\newcommand{\Trace}{\text{Trace}}\) \(\newcommand{\Cov}{\text{Cov}}\) \(\newcommand{\Var}{\text{Var}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\Pr}{\text{Pr}}\) \(\newcommand{\pr}{\text{pr}}\) \(\newcommand{\pdf}{\text{pdf}}\) \(\newcommand{\P}{\text{P}}\) \(\newcommand{\p}{\text{p}}\) \(\newcommand{\One}{\mathbf{1}}\) \(\newcommand{\argmin}{\operatorname*{arg\,min}}\) \(\newcommand{\argmax}{\operatorname*{arg\,max}}\)

Overview

Many of our previous examples assumes that we know the transition probability and the reward function. In practice, they can be estimated from a pre-collected dataset and then perform the algorithm. If we estimate the transition probability, then the method would be called a model-based reinforcement learning. However, in many cases, we may not be able to estimate them accurately. On the other hand, there are model-free reinforcement learning methods, which directly estimate the Q-function from the data. For example, once we have a new policy, we can roll out the policy to collect data and then estimate the Q-function (or value function) based on that policy. This is often done in an online setting, where we continuously update the Q-function as we collect more data. But one can also do this in an offline setting, where we (re-)estimate the value function of a policy based on a pre-collected dataset. Hence, many of the reinforcement learning algorithms can be viewed as either in the online or offline setting. But model-based and model-free algorithms are often distinguished since it may not be trivial to estimate the transition probability. In this lecture, we will focus on the model-free reinforcement learning, and in particular, the Temporal Difference (TD) learning, which is an algorithm for policy evaluation, and we will then extend it to policy optimization, using ideas similar to the policy iteration.

Temporal Difference Learning

The Temporal Difference (TD) learning (Sutton (1988)) is a model-free approach. The idea is to use the Bellman equation to update the value function or the Q-function using one step of the data. The fundamental difference between the TD learning apprach and our previously introduced value iteration is that the TD learning is a sample update method, meaning that we perform the update after collecting one more sample (next time point \(t+1\)), while the value iteration is an expected update method. Based on the Bellman equation, we define a TD error:

\[ \delta_t = R_t + \gamma V(S_{t+1}) - V(S_t), \]

The TD error is the difference between the observed reward and the expected value of the next state. The TD learning algorithm is then to update the value function based on the TD error:

\[ V(S_t) \leftarrow V(S_t) + \alpha \delta_t, \]

where \(\alpha\) is a learning rate1. Let’s first look at our previous example.

  # Define the transition probability matrix and rewards 
  P1 = matrix(c(0.8, 0.1, 0.1,
                0.05, 0.05, 0.9,
                0.8, 0.1, 0.1), byrow=TRUE, nrow=3)

  P2 = matrix(c(0.5, 0.25, 0.25,
                0.1, 0.8, 0.1,
                0.2, 0.2, 0.6), byrow=TRUE, nrow=3)

  # define the expected reward for each state and each action
  r = matrix(c(5, 3,
               1.6, 3,
               4, 2), byrow=TRUE, nrow=3)
  
  # define a function to simulate the next time point
  roll_t = function(St, At, P1, P2, r)
  {
    if (At == 1)
    {
      state = sample(1:3, 1, prob = P1[St,])
      reward = r[St, 1]
    } else {
      state = sample(1:3, 1, prob = P2[St,])
      reward = r[St, 2]
    }
    
    return(list("state" = state, 
                "reward" = reward))
  }
  
  # start with some arbitrary value function
  V = c(0, 0, 0)
  
  gamma = 0.7
  
  # use some arbitrary policy 
  mypolicy = matrix(c(1/2, 1/2,
                      1/3, 2/3,
                      3/4, 1/4),
                    byrow=TRUE, nrow=3)
  
  # we can compute theoretical value function
  P = sweep(P1, 1, mypolicy[,1], "*") + sweep(P2, 1, mypolicy[,2], "*")
  r_expected = rowSums(r * mypolicy)
  V_true = solve(diag(3) - gamma * P) %*% r_expected  
  
  # lets try to estimate the value function using TD learning
  
  # start with initial value function
  V = rep(8, 3)
  
  # number of iterations
  niter = 10000
  
  # some random initial state
  St = 1
  
  # learning rate
  alpha = 0.2
  
  # Record the estimation error
  error = rep(0, niter)

  # TD(0)
  set.seed(546)
  for (k in 1:niter)
  {
    # Generate action based on the policy
    At = sample(1:2, 1, prob = mypolicy[St,])
    
    # Based on the policy, simulate the next state
    Next_t = roll_t(St, At, P1, P2, r)
    
    # calculate the TD error
    delta = Next_t$reward + gamma * V[Next_t$state] - V[St]
    
    # update the value function
    # with polynomial decay learning rate
    V[St] = V[St] + alpha* 0.999^k * delta
    
    # update the state
    St = Next_t$state
    
    # record the error
    error[k] = sum((V - V_true)^2)
  }
  
  # TD learning estimation
  V
## [1] 12.29765 10.24672 11.78683
  
  # True value function
  as.vector(V_true)
## [1] 12.30744 10.23813 11.86436
  
  # plot the error
  plot(log(error), type="l", xlab="Iteration", ylab="Log-Error")

Again, the TD method can be viewed as an optimization problem, where we try to minimize the TD error. The TD error can be viewed as the gradient of the value function. But the gradient that we observed in each iteration is not the true gradient, but a noisy version of it (in some sense a stochastic gradient if we view this as offline). Hence, the TD learning is a stochastic optimization method. In this sense, TD method is also Monte Carle, since it is simulating the next state and reward. But keep in mind that we only simulate one step of the data, not waiting for the entire trajectory. Hence, this is much more efficient than the Monte Carlo method. This makes the TD learning a very popular method for online learning. The TD method we used here is called TD(\(0\)), and we will discuss what this means in the discussion.

On-policy TD Control: SARSA

The TD learning is a policy evaluation method, meaning that it is estimating the value function of a given policy. But we can also use the TD learning idea to optimize the policy. One of the most popular methods is the SARSA (State-Action-Reward-State-Action) method. This is an on-policy method, meaning that it will progressively update its policy and collect new data based on that policy. The idea is to update the Q-function based on a rule similar to the TD learning, but we need to consider the action as well. The TD error is then defined as:

\[ \delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \]

and the update rule is then:

\[ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \delta_t \]

We can then update the policy based on the Q-function. For example, choosing the action that maximizes the Q-function. However, this may cause troubles for the coverage of the state space, meaning that we could stuck in a policy that do not explore some states that could potentially have a higher value. In fact, this is related to the positivity assumption in casual inference and personalized medicine (see our lecture on observational study and outcome weighted learning) that you need to have some probability of choosing each treatment for each state, i.e., \(P(A | S) > 0\) for all \(S\). Hence, in online reinforcement learning algorithms, we often use an epsilon-greedy policy, which means that with probability \(1-\epsilon\), we choose the action that maximizes the Q-function, and with probability \(\epsilon\), we choose a random action. This is to ensure that the policy will explore the state space.

  # start with some random initial Q-function
  Q = matrix(10+runif(6), nrow=3, ncol=2)
  
  # some random initial state and action
  St = 1
  At = sample(1:2, 1)
  
  # epsilon for epsilon-greedy policy
  epsilon = 0.1
  
  # SARSA
  for (k in 1:niter)
  {
    # Based on the policy, simulate the next state
    Do_next = roll_t(St, At, P1, P2, r)
    
    # choose the next action based on the epsilon-greedy policy
    if (runif(1) < epsilon)
    {
      Do_next$'action' = sample(1:2, 1) # random sample
    } else {
      Do_next$'action' = which.max(Q[Do_next$state, ])
    }

    # calculate the TD error
    delta = Do_next$reward + gamma * Q[Do_next$state, Do_next$action] - Q[St, At]
    
    # update the Q-function
    Q[St, At] = Q[St, At] + alpha * 0.999^k * delta
    
    # update the state and action for next iteration
    St = Do_next$state
    At = Do_next$action
  }
  
  Q
##          [,1]     [,2]
## [1,] 15.35734 12.80108
## [2,] 11.44649 11.23804
## [3,] 14.30031 11.09832

This still produces the optimal policy (uses action 1 for all states). However, it can be slightly off from the true Q-function because we are not using the true policy to collect data. The epsilon-greedy policy still randomly select an action with probability \(\epsilon\). If we have too much exploration, then the policy may not be able to converge to the optimal policy. The following code use 50% chance to explore and will lead to a significant bias and even a wrong optimal policy2.

  # start with some random initial Q-function
  Q = matrix(10+runif(6), nrow=3, ncol=2)
  
  # some random initial state and action
  St = 1
  At = sample(1:2, 1)
  
  # epsilon for epsilon-greedy policy
  epsilon = 0.5
  
  # SARSA
  for (k in 1:niter)
  {
    # Based on the policy, simulate the next state
    Do_next = roll_t(St, At, P1, P2, r)
    
    # choose the next action based on the epsilon-greedy policy
    if (runif(1) < epsilon)
    {
      Do_next$'action' = sample(1:2, 1) # random sample
    } else {
      Do_next$'action' = which.max(Q[Do_next$state, ])
    }

    # calculate the TD error
    delta = Do_next$reward + gamma * Q[Do_next$state, Do_next$action] - Q[St, At]
    
    # update the Q-function
    Q[St, At] = Q[St, At] + alpha * 0.999^k * delta
    
    # update the state and action for next iteration
    St = Do_next$state
    At = Do_next$action
  }
  
  Q
##          [,1]     [,2]
## [1,] 14.29641 11.87361
## [2,] 10.35492 10.81260
## [3,] 13.28149 10.61535

This is a bit off from our previously estimated best Q-function mainly because of the epsilon-greedy policy, which has some chance of using a random action. This highlights the trade-off between exploration and exploitation, and the fact that SARSA is an on-policy method. When that “on-policy” is not the truth, SARSA does not give the Q-function of the optimal.

Off-policy TD Control: Q-learning

In some cases, we may not be able to collect data based on the policy that we want to evaluate. Can we still infer the optimal policy accurately even in that case? Q-learning (Watkins 1989) is a method for doing that. Comparing with SARSA, Q-learning is an off-policy method, meaning that it can evaluate the Q-function of one policy based on the data collected from another policy. Of course, the policy we want to infer is the optimal policy in this case. The idea is to update the Q-function based on the maximum Q-function of the next state, regardless of the action that we actually took. The TD error is then defined as:

\[ \delta_t = R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t) \]

and the update rule is then:

\[ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \delta_t \]

Let’s look at an example. In this case, let’s use a large \(\epsilon = 0.5\) for the epsilon-greedy. We know that SARSA will suffer from this since it is an on-policy method. But Q-learning will be able to learn the optimal policy even if we use a random policy to collect data. This is almost identical to our results in the contraction mapping lecture.

  # start with some random initial Q-function
  Q = matrix(runif(6), nrow=3, ncol=2)

  # number of iterations
  niter = 10000
  
  # some random initial state and action
  St = 1
  At = sample(1:2, 1)
  
  # epsilon for epsilon-greedy policy
  epsilon = 0.5
  
  # Q-learning
  for (k in 1:niter)
  {
    # Based on the policy, simulate the next state
    Do_next = roll_t(St, At, P1, P2, r)
    
    # choose the next action based on the epsilon-greedy policy
    if (runif(1) < epsilon)
    {
      Do_next$'action' = sample(1:2, 1) # random sample
    } else {
      Do_next$'action' = which.max(Q[Do_next$state, ])
    }

    # calculate the TD error
    delta = Do_next$reward + gamma * max(Q[Do_next$state, ]) - Q[St, At]
    
    # update the Q-function
    Q[St, At] = Q[St, At] + alpha * 0.999^k * delta
    
    # update the state and action for next iteration
    St = Do_next$state
    At = Do_next$action
  }
  
  Q
##          [,1]     [,2]
## [1,] 15.62324 13.07501
## [2,] 11.75856 11.68566
## [3,] 14.58785 11.97888

Discussion \(n\)-step TD Learning

The TD learning that we discussed so far is a one-step TD learning, meaning that we only use one step of the data to update the value function. If we think about the TD error, it is defined using the one-step return. But this may not be very accurate. Based on our understanding of the Bellman equation, the value function is the expected value of the return, which is the sum of the rewards and the discounted value of the next state. However, we expand this Bellman equation into more steps in the future:

\[ \begin{aligned} V(S_t) &= \E \left[ R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \cdots \right] \\ &= \E \left[ R_t + \gamma \cV(S_{t+1}) \right] \\ &= \E \left[ R_t + \gamma R_{t+1} + \gamma^2 \cV(S_{t+2}) \right] \\ &= \E \left[ R_t + \gamma R_{t+1} + \cdots + \gamma^{n-1} R_{t+n-1} + \gamma^n \cV(S_{t+n}) \right] \\ \end{aligned} \]

Hence, we can define the \(n\)-step TD error as

\[ \delta_t = R_t + \gamma R_{t+1} + \cdots + \gamma^{n-1} R_{t+n-1} + \gamma^n V(S_{t+n}) - V(S_t) \]

and the update rule is the same as the one-step TD learning. The \(n\)-step TD learning is a trade-off between the one-step TD learning and the Monte Carlo method. One can also implement this idea for the SARSA and Q-learning.

Reference

Bottou, Léon. 2012. “Stochastic Gradient Descent Tricks.” In Neural Networks: Tricks of the Trade: Second Edition, 421–36. Springer.
Sutton, Richard S. 1988. “Learning to Predict by the Methods of Temporal Differences.” Machine Learning 3: 9–44.
Watkins, Christopher John Cornish Hellaby. 1989. “Learning from Delayed Rewards.”

  1. One can either specify a very small learning rate, or use some common conditions such as \(\sum_{t=1}^\infty \alpha_t = \infty\) and \(\sum_{t=1}^\infty \alpha_t^2 < \infty\). We refer to Bottou (2012) for more details in practice.↩︎

  2. In practice, we can use a decayed \(\epsilon\), meaning that we start with a large \(\epsilon\) and then gradually decrease it. This is to ensure that the policy will explore the state space at the beginning and then gradually converge to the optimal policy.↩︎