In our previous examples, all states and actions are discrete. However, in many real-world applications, the state space and action space are continuous. In this lecture we discuss some general methods that can be used to handle continuous states and actions. We will first introduce a method called the fitted Q-iteration (Ernst, Geurts, and Wehenkel 2005) and use it as an example to illustrate the idea of functional embedding. This is an example for off-line or batch reinforcement learning, where we have a set of data and we want to estimate the optimal policy from this data.
Suppose we observed a set of offline data \(\{ (s_i, a_i, r_i, s_i') \}_{i=1}^n\)1, where \(s_i, s_i' \in \cS\), \(a_i \in \cA\), and \(r_i \in \mathbb{R}\). The goal is to estimate the optimal Q-function, which we know should satisfy the Bellman optimality equation:
\[ Q^\ast(s, a) = r(s, a) + \gamma \E_{s'} \big[ \max_{a'} Q^\ast(s', a') \big] \]
Suppose we have some estimates of the Q function, say \(\hat Q\), then a one-sample realization of the right hand side of the Bellman optimality equation, conditioning on the “covariates” \((s, a)\) is
\[ r + \gamma \max_{a'} \hat Q(s', a') \]
where \(s'\) is the next state after taking action \(a\) at state \(s\). We can then use this one-sample realization as outcomes to fit a regression model and learn its expectation, which is just \(Q^\ast(s, a)\). This is the idea of the fitted Q-iteration (FQI). To be precise, the algorithm is as follows:
To illustrate the idea of FQI, let’s consider a simulation example that mimic the cartpole problem. The cartpole problem is a classic control problem in reinforcement learning, where the goal is to balance a pole on a cart. Our design of the study has two states, where the first state mimics the location of a stick, and the second state mimics the velocity of the stick. The action space has two actions, where action 1 is to push the stick to the right, and action 2 is to push the stick to the left. The reward is defined as how close the stick is to 0.5.
set.seed(546) # for reproducibility
# Number of samples
n <- 2001
# Initialize matrices for states and next states
states <- matrix(NA, n, 2)
next_states <- matrix(NA, n, 2)
actions <- integer(n)
rewards <- numeric(n)
# Define the transition dynamics as a function, with some noise
transition <- function(s, a) {
s1_next = s[1] + 0.1*s[2] + 0.1 * (a == 1) - 0.1 * (a == 2) + rnorm(1, sd = 0.05)
s2_next = 0.9*s[2] - 0.4*s[1] + 0.05 * (a == 1) - 0.05 * (a == 2) + rnorm(1, sd = 0.05)
return(c(s1_next, s2_next))
}
# Define the reward function, using state at next time point
reward <- function(s) {
r <- -(s[1] - 0.5)^2
return(r)
}
# Initial state (can be randomized or set to a specific value)
states[1,] <- runif(2, min = -1, max = 1)
# Simulate the MDP
for (i in 1:(n-1)) {
# random action
actions[i] <- sample(1:2, 1)
# get the next state
next_states[i,] <- transition(states[i,], actions[i])
# Calculate the reward
rewards[i] <- reward(next_states[i,])
# update the state
states[i+1,] <- next_states[i,]
}
data = data.frame("s1" = states[, 1],
"s2" = states[, 2],
"a" = actions,
"r" = rewards,
"s1_next" = next_states[, 1],
"s2_next" = next_states[, 2])[-n, ]
head(data)
par(mfrow=c(1, 2))
m = 500
# plot the states
plot(data$s1[1:m], type = "l", col = "blue", lwd = 2,
xlab = "Time", ylab = "State", ylim = c(-2, 2))
lines(data$s2[1:m], col = "red", lwd = 2)
legend("topright", c("State 1", "State 2"), col = c("blue", "red"), lwd = 2)
# plot the rewards
plot(data$r[1:m], col = "purple", pch = 19,
xlab = "Time", ylab = "Reward", ylim = c(-3, 0.1))
Now we use the FQI algorithm to estimate the Q-function. We will use a linear model with interactions to parameterize the Q-function.
# initialize the Q-function
Q_fit <- lm(r ~ s1 * s2 * a, data = data)
# number of iterations
n_iter <- 100
# discount factor
gamma <- 0.8
# data for the next state
data_next1 = data.frame("s1" = data$s1_next, "s2" = data$s2_next, "a" = rep(1, nrow(data)))
data_next2 = data.frame("s1" = data$s1_next, "s2" = data$s2_next, "a" = rep(2, nrow(data)))
for (k in 1:n_iter) {
# calculate the Q-value for each action
Q1 = predict(Q_fit, newdata = data_next1)
Q2 = predict(Q_fit, newdata = data_next2)
# get the target value
target = data$r + gamma * pmax(Q1, Q2)
# update the Q-function
Q_fit = lm(target ~ s1 * s2 * a, data = data)
}
summary(Q_fit)
##
## Call:
## lm(formula = target ~ s1 * s2 * a, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3340 -0.1203 0.0234 0.1539 0.6532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.55572 0.01735 32.033 < 2e-16 ***
## s1 2.12397 0.05147 41.267 < 2e-16 ***
## s2 1.09670 0.02842 38.593 < 2e-16 ***
## a -0.66561 0.01093 -60.885 < 2e-16 ***
## s1:s2 0.53347 0.07578 7.040 2.64e-12 ***
## s1:a 0.32104 0.03245 9.894 < 2e-16 ***
## s2:a -0.02869 0.01786 -1.606 0.108
## s1:s2:a -0.01400 0.04854 -0.288 0.773
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2319 on 1992 degrees of freedom
## Multiple R-squared: 0.945, Adjusted R-squared: 0.9448
## F-statistic: 4891 on 7 and 1992 DF, p-value: < 2.2e-16
We can check if this new policy performs better or now.
# simulate the new policy
states_new = matrix(NA, n, 2)
next_states_new <- matrix(NA, n, 2)
actions_new = integer(n)
rewards_new = numeric(n)
# Initial state
states_new[1,] <- c(-1, 1)
# a temporary data frame to store the next state
data_new1 = data.frame("s1" = states_new[1,1], "s2" = states_new[1,2], "a" = 1)
data_new2 = data.frame("s1" = states_new[1,1], "s2" = states_new[1,2], "a" = 2)
for (i in 1:(n-1)) {
# calculate the best action
Q1 = predict(Q_fit, newdata = data_new1)
Q2 = predict(Q_fit, newdata = data_new2)
# select the best action
actions_new[i] = ifelse(Q1 > Q2, 1, 2)
# generate the next state
next_states_new[i,] <- transition(states_new[i,], actions_new[i])
# Calculate the reward
rewards_new[i] <- reward(next_states_new[i,])
# update the state
states_new[i+1,] <- next_states_new[i,]
# update the temporary data frame
data_new1 = data.frame("s1" = next_states_new[i,1], "s2" = next_states_new[i,2], "a" = 1)
data_new2 = data.frame("s1" = next_states_new[i,1], "s2" = next_states_new[i,2], "a" = 2)
}
data_FQI = data.frame("s1" = states_new[, 1],
"s2" = states_new[, 2],
"a" = actions_new,
"r" = rewards_new,
"s1_next" = next_states_new[, 1],
"s2_next" = next_states_new[, 2])[-n, ]
par(mfrow=c(1, 2))
m = 300
# plot the states
plot(data_FQI$s1[1:m], type = "l", col = "blue", lwd = 2,
xlab = "Time", ylab = "State", ylim = c(-2, 2))
lines(data_FQI$s2[1:m], col = "red", lwd = 2)
legend("topright", c("State 1", "State 2"), col = c("blue", "red"), lwd = 2)
# plot the rewards
plot(data_FQI$r[1:m], col = "purple", pch = 19,
xlab = "Time", ylab = "Reward", ylim = c(-3, 0.1))
In fact, the original fitted Q iteration proposed by Ernst, Geurts, and Wehenkel (2005) uses a tree-based model (CART, bagging, random forests etc) to estimate the Q function. And it is easy to see that we can choose whatever model that could best fit the data.
For tabular cases, with finite number of action choices, we know that the Q function can only take \(| \cS | \times | \cA |\) number of different values. As long as we model the Q function for each of such combinations, it is guaranteed to capture the true Q function. However, for continuous states and actions, the Q function can take infinite number of values. In practice, we can only approximate the Q function with a finite number of parameters. The question is then how to ensure that the approximation is good enough. The linear Bellman Completeness condition (Munos 2005) was proposed to address this issue. It essentially states that if the Q function can be approximated by a linear combination of basis function.
Definition 1 (Linear Bellman Completeness) Denote \(\phi : \cS \times \cA \rightarrow \mathbb{R}^d\) as a set of feature maps, so that a linear function of the form \(f(s, a) = \theta^\T \phi(s, a)\) can be used to approximate the Q function. Then this feature mapping satisfy the linear Bellman completeness property if for any \(\theta \in \mathbb{R}^d\) and \((s, a) \in \cS \times \cA\), there exists \(w \in \mathbb{R}^d\) such that \[ w^\T \phi(s, a) = r(s, a) + \gamma \E_{s'} \big[ \max_{a'} \theta^\T \phi(s', a') \big] \]
Hence, the for tabular case, the linear Bellman completeness condition is trivially satisfied, if we define the feature maps as the indicator functions of each state-action pair. This condition also implies that the reward function should be linear in the feature space. But our previous example does not satisfy this condition. Still, our result shows that we had a good approximation of the Q function. Overall, the linear Bellman completeness condition is a pretty strong assumption to guarantee the convergence of many reinforcement learning algorithms. However, this can still be a good approximation even when it is violated.
There are also other schemes of how to specify the linear completeness. For example, when we only care about policy evaluation of a fixed policy \(\pi\), we can use the following definition:
Definition 2 (Linear Bellman Completeness for Policy Evaluation) Under the same setting as in Definition 1, we require that for a given policy \(\pi\), \[ w^\T \phi(s, a) = r(s, a) + \gamma \E_{s'} \big[ \E_{a' \sim \pi} \big[ \theta^\T \phi(s', a') \big] \big] \]
However, these are all very technical conditions. Jin et al. (2023) provides a sufficient condition called linear MDP under a finite horizon problem. A linear MDP is a special case of MDP where the reward function and transition probability are linear in the state and action.
Definition 3 (Linear MDP) There exist a feature mapping \(\phi : \cS \times \cA \rightarrow \mathbb{R}^d\) and \(d\)-vector unknown measures \(\bmu = ( \mu_1, \ldots, \mu_d)^\T\) on \(\cS\) such that the reward function and transition probability can be written as
\[ r(s, a) = \eta^\T \phi(s, a) \quad \text{and} \quad P(s' | s, a) = \bmu(s')^\T \phi(s, a), \quad \text{respectively.} \]
Under the linear MDP assumption, the Bellman operator update for any value function \(V(s)\) can be written as
\[ \begin{aligned} T^\pi Q(s, a) &= r(s, a) + \gamma \E_{s'} \big[ V(s') \big] \\ &= \eta^\T \phi(s, a) + \gamma \int_{s'} V(s') \bmu(s')^\T \phi(s, a) \, ds' \\ &= \bigg\langle \phi(s, a), \eta + \gamma \int_{s'} V(s') \bmu(s') \, ds' \bigg\rangle \end{aligned} \]
Hence, we can define weights
\[ w = \eta + \gamma \int_{s'} V(s') \bmu(s') \, ds' \]
which will satisfy the linear Bellman completeness condition. Some care is needed for the norm of \(w\) to ensure the convergence of the algorithm. Jin et al. (2023) showed the case of a finite horizon problem, while Wei et al. (2021) did a average reward problem. Infinite horizon with discounted setting is still an open problem.
Choosing the feature maps is essential the same as choosing a functional space where the Q function lies. The choice of the feature maps can be crucial for the performance of the algorithm. In the previous example, we used a linear model with interactions. This is a very simple choice. In practice, the reproducing kernel Hilbert space (RKHS) is a popular choice. The RKHS is a space of functions where the inner product is defined by a kernel function. The kernel function is a symmetric positive definite function. The RKHS is a very flexible space that can approximate any continuous function. Some properties of the RKHS can be found in this lecture note.
This can either come from a single sequence of trajectory or be combined from a set of multiple trajectories.↩︎