\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\cA}{\mathcal{A}}\) \(\newcommand{\cB}{\mathcal{B}}\) \(\newcommand{\cC}{\mathcal{C}}\) \(\newcommand{\cD}{\mathcal{D}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cF}{\mathcal{F}}\) \(\newcommand{\cG}{\mathcal{G}}\) \(\newcommand{\cH}{\mathcal{H}}\) \(\newcommand{\cI}{\mathcal{I}}\) \(\newcommand{\cJ}{\mathcal{J}}\) \(\newcommand{\cK}{\mathcal{K}}\) \(\newcommand{\cL}{\mathcal{L}}\) \(\newcommand{\cM}{\mathcal{M}}\) \(\newcommand{\cN}{\mathcal{N}}\) \(\newcommand{\cO}{\mathcal{O}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\cQ}{\mathcal{Q}}\) \(\newcommand{\cR}{\mathcal{R}}\) \(\newcommand{\cS}{\mathcal{S}}\) \(\newcommand{\cT}{\mathcal{T}}\) \(\newcommand{\cU}{\mathcal{U}}\) \(\newcommand{\cV}{\mathcal{V}}\) \(\newcommand{\cW}{\mathcal{W}}\) \(\newcommand{\cX}{\mathcal{X}}\) \(\newcommand{\cY}{\mathcal{Y}}\) \(\newcommand{\cZ}{\mathcal{Z}}\) \(\newcommand{\bA}{\mathbf{A}}\) \(\newcommand{\bB}{\mathbf{B}}\) \(\newcommand{\bC}{\mathbf{C}}\) \(\newcommand{\bD}{\mathbf{D}}\) \(\newcommand{\bE}{\mathbf{E}}\) \(\newcommand{\bF}{\mathbf{F}}\) \(\newcommand{\bG}{\mathbf{G}}\) \(\newcommand{\bH}{\mathbf{H}}\) \(\newcommand{\bI}{\mathbf{I}}\) \(\newcommand{\bJ}{\mathbf{J}}\) \(\newcommand{\bK}{\mathbf{K}}\) \(\newcommand{\bL}{\mathbf{L}}\) \(\newcommand{\bM}{\mathbf{M}}\) \(\newcommand{\bN}{\mathbf{N}}\) \(\newcommand{\bO}{\mathbf{O}}\) \(\newcommand{\bP}{\mathbf{P}}\) \(\newcommand{\bQ}{\mathbf{Q}}\) \(\newcommand{\bR}{\mathbf{R}}\) \(\newcommand{\bS}{\mathbf{S}}\) \(\newcommand{\bT}{\mathbf{T}}\) \(\newcommand{\bU}{\mathbf{U}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\ba}{\mathbf{a}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\bd}{\mathbf{d}}\) \(\newcommand{\be}{\mathbf{e}}\) \(\newcommand{\bg}{\mathbf{g}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bi}{\mathbf{i}}\) \(\newcommand{\bj}{\mathbf{j}}\) \(\newcommand{\bk}{\mathbf{k}}\) \(\newcommand{\bl}{\mathbf{l}}\) \(\newcommand{\bm}{\mathbf{m}}\) \(\newcommand{\bn}{\mathbf{n}}\) \(\newcommand{\bo}{\mathbf{o}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bq}{\mathbf{q}}\) \(\newcommand{\br}{\mathbf{r}}\) \(\newcommand{\bs}{\mathbf{s}}\) \(\newcommand{\bt}{\mathbf{t}}\) \(\newcommand{\bu}{\mathbf{u}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\hpi}{\widehat{\pi}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\hbpi}{\widehat{\boldsymbol{\pi}}}\) \(\newcommand{\bxi}{\boldsymbol{\xi}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bzero}{\mathbf{0}}\) \(\newcommand{\T}{\text{T}}\) \(\newcommand{\Trace}{\text{Trace}}\) \(\newcommand{\Cov}{\text{Cov}}\) \(\newcommand{\Var}{\text{Var}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\Pr}{\text{Pr}}\) \(\newcommand{\pr}{\text{pr}}\) \(\newcommand{\pdf}{\text{pdf}}\) \(\newcommand{\P}{\text{P}}\) \(\newcommand{\p}{\text{p}}\) \(\newcommand{\One}{\mathbf{1}}\) \(\newcommand{\argmin}{\operatorname*{arg\,min}}\) \(\newcommand{\argmax}{\operatorname*{arg\,max}}\)

Overview

In our previous examples, all states and actions are discrete. However, in many real-world applications, the state space and action space are continuous. In this lecture we discuss some general methods that can be used to handle continuous states and actions. We will first introduce a method called the fitted Q-iteration (Ernst, Geurts, and Wehenkel 2005) and use it as an example to illustrate the idea of functional embedding. This is an example for off-line or batch reinforcement learning, where we have a set of data and we want to estimate the optimal policy from this data.

Fitted Q-Iteration

Suppose we observed a set of offline data \(\{ (s_i, a_i, r_i, s_i') \}_{i=1}^n\)1, where \(s_i, s_i' \in \cS\), \(a_i \in \cA\), and \(r_i \in \mathbb{R}\). The goal is to estimate the optimal Q-function, which we know should satisfy the Bellman optimality equation:

\[ Q^\ast(s, a) = r(s, a) + \gamma \E_{s'} \big[ \max_{a'} Q^\ast(s', a') \big] \]

Suppose we have some estimates of the Q function, say \(\hat Q\), then a one-sample realization of the right hand side of the Bellman optimality equation, conditioning on the “covariates” \((s, a)\) is

\[ r + \gamma \max_{a'} \hat Q(s', a') \]

where \(s'\) is the next state after taking action \(a\) at state \(s\). We can then use this one-sample realization as outcomes to fit a regression model and learn its expectation, which is just \(Q^\ast(s, a)\). This is the idea of the fitted Q-iteration (FQI). To be precise, the algorithm is as follows:

  • Parameterize the Q-function as \(f_\theta(s, a)\) and initialize \(f_{\theta_0}(s, a)\)
  • For \(k = 1, 2, \ldots\)
    • Update the estimate of parameter \(\theta_k\) by minimizing the following \(\ell_2\) \[ \theta_k = \argmin_{\theta} \frac{1}{n} \sum_{i=1}^n \left( f_{\theta_{k}}(s_i, a_i) - r_i - \gamma \max_{a'} f_{\theta_{k-1}}(s_i', a') \right)^2 \]

Example: Linear Q-Function

To illustrate the idea of FQI, let’s consider a simulation example that mimic the cartpole problem. The cartpole problem is a classic control problem in reinforcement learning, where the goal is to balance a pole on a cart. Our design of the study has two states, where the first state mimics the location of a stick, and the second state mimics the velocity of the stick. The action space has two actions, where action 1 is to push the stick to the right, and action 2 is to push the stick to the left. The reward is defined as how close the stick is to 0.5.

  set.seed(546)  # for reproducibility

  # Number of samples
  n <- 2001
  
  # Initialize matrices for states and next states
  states <- matrix(NA, n, 2)
  next_states <- matrix(NA, n, 2)
  actions <- integer(n)
  rewards <- numeric(n)
  
  # Define the transition dynamics as a function, with some noise
  transition <- function(s, a) {
    s1_next = s[1] + 0.1*s[2] + 0.1 * (a == 1) - 0.1 * (a == 2) + rnorm(1, sd = 0.05)
    s2_next = 0.9*s[2] - 0.4*s[1] + 0.05 * (a == 1) - 0.05 * (a == 2) + rnorm(1, sd = 0.05)
    return(c(s1_next, s2_next))
  }

  # Define the reward function, using state at next time point
  reward <- function(s) {
    r <- -(s[1] - 0.5)^2
    return(r)
  }

  # Initial state (can be randomized or set to a specific value)
  states[1,] <- runif(2, min = -1, max = 1)
  
  # Simulate the MDP
  for (i in 1:(n-1)) {
    # random action
    actions[i] <- sample(1:2, 1)
  
    # get the next state
    next_states[i,] <- transition(states[i,], actions[i])
    
    # Calculate the reward
    rewards[i] <- reward(next_states[i,])
    
    # update the state
    states[i+1,] <- next_states[i,]
  }
  
  data = data.frame("s1" = states[, 1],
                    "s2" = states[, 2],
                    "a" = actions,
                    "r" = rewards,
                    "s1_next" = next_states[, 1],
                    "s2_next" = next_states[, 2])[-n, ]

  head(data)
  par(mfrow=c(1, 2))

  m = 500
  # plot the states
  plot(data$s1[1:m], type = "l", col = "blue", lwd = 2, 
       xlab = "Time", ylab = "State", ylim = c(-2, 2))
  lines(data$s2[1:m], col = "red", lwd = 2)
  legend("topright", c("State 1", "State 2"), col = c("blue", "red"), lwd = 2)
  
  # plot the rewards
  plot(data$r[1:m], col = "purple", pch = 19, 
       xlab = "Time", ylab = "Reward", ylim = c(-3, 0.1))

Now we use the FQI algorithm to estimate the Q-function. We will use a linear model with interactions to parameterize the Q-function.

  # initialize the Q-function
  Q_fit <- lm(r ~ s1 * s2 * a, data = data)

  # number of iterations
  n_iter <- 100
  
  # discount factor
  gamma <- 0.8
  
  # data for the next state
  data_next1 = data.frame("s1" = data$s1_next, "s2" = data$s2_next, "a" = rep(1, nrow(data)))
  data_next2 = data.frame("s1" = data$s1_next, "s2" = data$s2_next, "a" = rep(2, nrow(data)))
  
  for (k in 1:n_iter) {
    # calculate the Q-value for each action
    Q1 = predict(Q_fit, newdata = data_next1)
    Q2 = predict(Q_fit, newdata = data_next2)
    
    # get the target value
    target = data$r + gamma * pmax(Q1, Q2)
    
    # update the Q-function
    Q_fit = lm(target ~ s1 * s2 * a, data = data)
  }
  
  summary(Q_fit)
## 
## Call:
## lm(formula = target ~ s1 * s2 * a, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3340 -0.1203  0.0234  0.1539  0.6532 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.55572    0.01735  32.033  < 2e-16 ***
## s1           2.12397    0.05147  41.267  < 2e-16 ***
## s2           1.09670    0.02842  38.593  < 2e-16 ***
## a           -0.66561    0.01093 -60.885  < 2e-16 ***
## s1:s2        0.53347    0.07578   7.040 2.64e-12 ***
## s1:a         0.32104    0.03245   9.894  < 2e-16 ***
## s2:a        -0.02869    0.01786  -1.606    0.108    
## s1:s2:a     -0.01400    0.04854  -0.288    0.773    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2319 on 1992 degrees of freedom
## Multiple R-squared:  0.945,  Adjusted R-squared:  0.9448 
## F-statistic:  4891 on 7 and 1992 DF,  p-value: < 2.2e-16

We can check if this new policy performs better or now.

  # simulate the new policy
  states_new = matrix(NA, n, 2)
  next_states_new <- matrix(NA, n, 2)
  actions_new = integer(n)
  rewards_new = numeric(n)
  
  # Initial state
  states_new[1,] <- c(-1, 1)
  
  # a temporary data frame to store the next state
  data_new1 = data.frame("s1" = states_new[1,1], "s2" = states_new[1,2], "a" = 1)
  data_new2 = data.frame("s1" = states_new[1,1], "s2" = states_new[1,2], "a" = 2)
  
  for (i in 1:(n-1)) {
    
    # calculate the best action
    Q1 = predict(Q_fit, newdata = data_new1)
    Q2 = predict(Q_fit, newdata = data_new2)
    
    # select the best action
    actions_new[i] = ifelse(Q1 > Q2, 1, 2)
    
    # generate the next state
    next_states_new[i,] <- transition(states_new[i,], actions_new[i])
    
    # Calculate the reward
    rewards_new[i] <- reward(next_states_new[i,])
    
    # update the state
    states_new[i+1,] <- next_states_new[i,]
    
    # update the temporary data frame
    data_new1 = data.frame("s1" = next_states_new[i,1], "s2" = next_states_new[i,2], "a" = 1)
    data_new2 = data.frame("s1" = next_states_new[i,1], "s2" = next_states_new[i,2], "a" = 2)
  }
  
  data_FQI = data.frame("s1" = states_new[, 1],
                        "s2" = states_new[, 2],
                        "a" = actions_new,
                        "r" = rewards_new,
                        "s1_next" = next_states_new[, 1],
                        "s2_next" = next_states_new[, 2])[-n, ]
  par(mfrow=c(1, 2))

  m = 300
  # plot the states
  plot(data_FQI$s1[1:m], type = "l", col = "blue", lwd = 2, 
       xlab = "Time", ylab = "State", ylim = c(-2, 2))
  lines(data_FQI$s2[1:m], col = "red", lwd = 2)
  legend("topright", c("State 1", "State 2"), col = c("blue", "red"), lwd = 2)
  
  # plot the rewards
  plot(data_FQI$r[1:m], col = "purple", pch = 19, 
       xlab = "Time", ylab = "Reward", ylim = c(-3, 0.1))

In fact, the original fitted Q iteration proposed by Ernst, Geurts, and Wehenkel (2005) uses a tree-based model (CART, bagging, random forests etc) to estimate the Q function. And it is easy to see that we can choose whatever model that could best fit the data.

Linear Bellman Completeness and Linear MDP

For tabular cases, with finite number of action choices, we know that the Q function can only take \(| \cS | \times | \cA |\) number of different values. As long as we model the Q function for each of such combinations, it is guaranteed to capture the true Q function. However, for continuous states and actions, the Q function can take infinite number of values. In practice, we can only approximate the Q function with a finite number of parameters. The question is then how to ensure that the approximation is good enough. The linear Bellman Completeness condition (Munos 2005) was proposed to address this issue. It essentially states that if the Q function can be approximated by a linear combination of basis function.

Definition 1 (Linear Bellman Completeness) Denote \(\phi : \cS \times \cA \rightarrow \mathbb{R}^d\) as a set of feature maps, so that a linear function of the form \(f(s, a) = \theta^\T \phi(s, a)\) can be used to approximate the Q function. Then this feature mapping satisfy the linear Bellman completeness property if for any \(\theta \in \mathbb{R}^d\) and \((s, a) \in \cS \times \cA\), there exists \(w \in \mathbb{R}^d\) such that \[ w^\T \phi(s, a) = r(s, a) + \gamma \E_{s'} \big[ \max_{a'} \theta^\T \phi(s', a') \big] \]

Hence, the for tabular case, the linear Bellman completeness condition is trivially satisfied, if we define the feature maps as the indicator functions of each state-action pair. This condition also implies that the reward function should be linear in the feature space. But our previous example does not satisfy this condition. Still, our result shows that we had a good approximation of the Q function. Overall, the linear Bellman completeness condition is a pretty strong assumption to guarantee the convergence of many reinforcement learning algorithms. However, this can still be a good approximation even when it is violated.

There are also other schemes of how to specify the linear completeness. For example, when we only care about policy evaluation of a fixed policy \(\pi\), we can use the following definition:

Definition 2 (Linear Bellman Completeness for Policy Evaluation) Under the same setting as in Definition 1, we require that for a given policy \(\pi\), \[ w^\T \phi(s, a) = r(s, a) + \gamma \E_{s'} \big[ \E_{a' \sim \pi} \big[ \theta^\T \phi(s', a') \big] \big] \]

However, these are all very technical conditions. Jin et al. (2023) provides a sufficient condition called linear MDP under a finite horizon problem. A linear MDP is a special case of MDP where the reward function and transition probability are linear in the state and action.

Definition 3 (Linear MDP) There exist a feature mapping \(\phi : \cS \times \cA \rightarrow \mathbb{R}^d\) and \(d\)-vector unknown measures \(\bmu = ( \mu_1, \ldots, \mu_d)^\T\) on \(\cS\) such that the reward function and transition probability can be written as

\[ r(s, a) = \eta^\T \phi(s, a) \quad \text{and} \quad P(s' | s, a) = \bmu(s')^\T \phi(s, a), \quad \text{respectively.} \]

Under the linear MDP assumption, the Bellman operator update for any value function \(V(s)\) can be written as

\[ \begin{aligned} T^\pi Q(s, a) &= r(s, a) + \gamma \E_{s'} \big[ V(s') \big] \\ &= \eta^\T \phi(s, a) + \gamma \int_{s'} V(s') \bmu(s')^\T \phi(s, a) \, ds' \\ &= \bigg\langle \phi(s, a), \eta + \gamma \int_{s'} V(s') \bmu(s') \, ds' \bigg\rangle \end{aligned} \]

Hence, we can define weights

\[ w = \eta + \gamma \int_{s'} V(s') \bmu(s') \, ds' \]

which will satisfy the linear Bellman completeness condition. Some care is needed for the norm of \(w\) to ensure the convergence of the algorithm. Jin et al. (2023) showed the case of a finite horizon problem, while Wei et al. (2021) did a average reward problem. Infinite horizon with discounted setting is still an open problem.

Choosing the Feature Maps

Choosing the feature maps is essential the same as choosing a functional space where the Q function lies. The choice of the feature maps can be crucial for the performance of the algorithm. In the previous example, we used a linear model with interactions. This is a very simple choice. In practice, the reproducing kernel Hilbert space (RKHS) is a popular choice. The RKHS is a space of functions where the inner product is defined by a kernel function. The kernel function is a symmetric positive definite function. The RKHS is a very flexible space that can approximate any continuous function. Some properties of the RKHS can be found in this lecture note.

Reference

Ernst, Damien, Pierre Geurts, and Louis Wehenkel. 2005. “Tree-Based Batch Mode Reinforcement Learning.” Journal of Machine Learning Research 6.
Jin, Chi, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. 2023. “Provably Efficient Reinforcement Learning with Linear Function Approximation.” Mathematics of Operations Research 48 (3): 1496–1521.
Munos, Rémi. 2005. “Error Bounds for Approximate Value Iteration.” In Proceedings of the National Conference on Artificial Intelligence, 20:1006. 2. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999.
Wei, Chen-Yu, Mehdi Jafarnia Jahromi, Haipeng Luo, and Rahul Jain. 2021. “Learning Infinite-Horizon Average-Reward Mdps with Linear Function Approximation.” In International Conference on Artificial Intelligence and Statistics, 3007–15. PMLR.

  1. This can either come from a single sequence of trajectory or be combined from a set of multiple trajectories.↩︎