Up to now, there are several components that we have not really dealt with:
A very popular algorithm that addresses all these challenges is the Fitted Q-Iteration algorithm, proposed by Ernst, Geurts, and Wehenkel (2005). The key idea is to approximate the Q-function using a regression model (e.g., random forests, kernel methods, or neural networks), based on the Bellman optimality equation, and iteratively update this estimate. The algorithm proceeds as follows:
In this algorithm, the target value corresponds to the random quantity (due to the randomness of the transitioning and reward) inside the expectation of the Bellman optimality operator:
\[ \begin{aligned} \cT_Q^\ast Q(s,a) &= r(s,a) + \gamma \E_{S' \sim p(\cdot \mid s,a)} \left[ \max_{a'} Q(S', a') \right] \\ &= \E_{R, S' \sim p(\cdot \mid s,a)} \left[ R + \gamma \max_{a'} Q(S', a') \right]\\ &= \E \left[ Y \mid S=s, A=a \right], \end{aligned} \]
Since we only observe samples \((r_i, s'_i)\), the empirical target \(y_i\) provides a sample-based approximation to the Bellman optimality backup.
The CartPole environment is a classic benchmark problem in reinforcement learning. It involves balancing a pole on a moving cart by applying forces to the cart. The state space is continuous, and the action space can be either discrete (e.g., left or right force) or continuous (e.g., varying force magnitude). We have created an environment simulator in R. The details of the environment dynamics and parameters can be found in our cartpole document.
# CartPole environment with actions in [-1, 1]
cartpole_step <- function(state, action,
reward.method = "discrete") {
# Unpack state
x <- state[1]
x_dot <- state[2]
theta <- state[3]
theta_dot <- state[4]
# Parameters
g <- 9.8
m_c <- 1.0
m_p <- 0.1
l <- 0.5
F_max <- 5
tau <- 0.01
x_th <- 0.2
theta_th <- 12 * (pi / 180)
total_mass <- m_c + m_p
# Action in [-1, 1]; discrete {-1, +1} is a special case
f <- action * F_max
# Compute accelerations
costheta <- cos(theta)
sintheta <- sin(theta)
temp <- (f + m_p * l * theta_dot^2 * sintheta) / total_mass
theta_acc <- (g * sintheta - costheta * temp) /
(l * (4/3 - (m_p * costheta^2) / total_mass))
x_acc <- temp - (m_p * l * theta_acc * costheta) / total_mass
# Update state using semi-implicit Euler
x_dot_new <- x_dot + tau * x_acc
x_new <- x + tau * x_dot_new
theta_dot_new <- theta_dot + tau * theta_acc
theta_new <- theta + tau * theta_dot_new
new_state <- c(x_new, x_dot_new, theta_new, theta_dot_new)
# Compute termination
done <- (abs(x_new) > x_th) ||
(abs(theta_new) > theta_th)
# Compute reward
if (reward.method == "discrete") {
reward <- ifelse(abs(theta_new) > 0.1 | abs(x_new) > 0.1, 0, 1)
} else if (reward.method == "continuous") {
reward <- - ( 0.01 * x_new^2 + # position loss
100 * theta_new^2 + # angle loss
0.01 * x_dot_new^2 + # position stability
0.1 * theta_dot_new^2) # angle stability
} else {
stop("Unknown reward.method")
}
return(list(state = new_state, reward = reward, done = done))
}
# Function to initialize the cartpole state
cartpole_initial <- function() {
state <- runif(4, min = -0.05, max = 0.05)
return(state)
}
# function to generating episodes
generate_cartpole <- function(n_episode = 1,
T_max = 200,
reward.method = "discrete",
policy = function(state) {
sample(c(-1, 1), 1)
}) {
all_rows <- list() # store rows before binding
row_id <- 1
for (ep in 1:n_episode) {
state <- cartpole_initial()
for (t in 1:T_max) {
# Generate action
A_t <- policy(state)
# check action range
if (abs(A_t) > 1) {
stop("Action out of range [-1, 1]")
}
# Step the environment
out <- cartpole_step(state, A_t, reward.method = reward.method)
S_next <- out$state
R_t <- out$reward
done <- out$done
# Record (S_t, A_t, R_t)
all_rows[[row_id]] <- data.frame(
episode = ep,
time = t,
x = state[1],
x_dot = state[2],
theta = state[3],
theta_dot = state[4],
action = A_t,
reward = R_t
)
row_id <- row_id + 1
# Move to next state
state <- S_next
if (done) break
}
}
# Combine into a dataframe
batch_data <- do.call(rbind, all_rows)
rownames(batch_data) <- NULL
return(batch_data)
}
set.seed(546)
cartpole_data = generate_cartpole(n_episode = 200, reward.method = "continuous")
# plot the trajectories of the cart position and rewards
library(ggplot2)
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.5.2
plot_cartpole <- function(data) {
p1 <- ggplot(data, aes(x = time, y = x, color = factor(episode))) +
geom_line() +
labs(title = "Cart Position (x)", x = "Time", y = "x") +
theme_minimal() +
scale_y_continuous(limits = c(-0.2, 0.2)) +
theme(legend.position = "none")
p2 <- ggplot(data, aes(x = time, y = x_dot, color = factor(episode))) +
geom_line() +
labs(title = "Cart Velocity (x_dot)", x = "Time", y = "x_dot") +
theme_minimal() +
theme(legend.position = "none")
p3 <- ggplot(data, aes(x = time, y = theta, color = factor(episode))) +
geom_line() +
labs(title = "Pole Angle (theta)", x = "Time", y = "theta") +
theme_minimal() +
scale_y_continuous(limits = c(-12 * (pi / 180), 12 * (pi / 180))) +
theme(legend.position = "none")
p4 <- ggplot(data, aes(x = time, y = theta_dot, color = factor(episode))) +
geom_line() +
labs(title = "Pole Angular Velocity (theta_dot)", x = "Time", y = "theta_dot") +
theme_minimal() +
theme(legend.position = "none")
p5 <- ggplot(data, aes(x = time, y = reward, color = factor(episode))) +
geom_line() +
labs(title = "Reward", x = "Time", y = "reward") +
theme_minimal() +
theme(legend.position = "none")
# 2x2 layout
(p1 | p2) /
(p3 | p4) /
p5
}
plot_cartpole(cartpole_data)
The original fitted Q-iteration algorithm was proposed with machine learning models such as random forests. Since the CartPole example is rather simple, we will implement this via a linear function
# Initial Q-function estimate: fit reward ~ state + action
linear_formula = y_targets ~ x + x_dot + theta + theta_dot + action +
action*x + action*x_dot + action*theta + action*theta_dot
y_targets = cartpole_data$reward
fit_qliner <- lm(linear_formula, data = cartpole_data)
# Settings
n_iter <- 100
gamma <- 0.9
# Start to iterate
for (iter in 1:n_iter) {
# Prepare data for next states with both actions
Data_Next = cartpole_data[cartpole_data$time > 1, ]
Last_T = cartpole_data$time == ave(cartpole_data$time, cartpole_data$episode, FUN = max)
Data_Current = cartpole_data[!Last_T, ]
Data_Next_A1 = Data_Next
Data_Next_A1$action = -1
Data_Next_A2 = Data_Next
Data_Next_A2$action = 1
# Predict Q-values for next states with both actions
Q_next_A1 <- predict(fit_qliner, newdata = Data_Next_A1)
Q_next_A2 <- predict(fit_qliner, newdata = Data_Next_A2)
# Compute max Q-value for next states
Q_next_max <- pmax(Q_next_A1, Q_next_A2)
# Compute bootstrapped targets
y_targets <- Data_Current$reward + gamma * Q_next_max
# Fit new regression model
fit_qliner <- lm(linear_formula, data = Data_Current)
}
Based on this fitted linear Q-function, we can derive a greedy policy by selecting the action that maximizes the Q-value for each state. This is essentially plug the states into the fitted Q-function and choose the action accordingly.
# Derive greedy policy from fitted Q-function
greedy_policy <- function(state, Qfun) {
state_df <- data.frame(
x = state[1],
x_dot = state[2],
theta = state[3],
theta_dot = state[4]
)
# Action = -1
state_df$action <- -1
state_df$action_sq <- 1
Q_A1 <- predict(Qfun, newdata = state_df)
# Action = +1
state_df$action <- 1
state_df$action_sq <- 1
Q_A2 <- predict(Qfun, newdata = state_df)
if (Q_A1 > Q_A2) {
return(-1)
} else {
return(1)
}
}
We can now use this function to generate new episodes under the learned policy and evaluate its performance.
# Evaluate the learned policy
new_policy <- function(state) {
greedy_policy(state, Qfun = fit_qliner)
}
# Generate new episodes using the learned policy
test_data <- generate_cartpole(n_episode = 50,
reward.method = "continuous",
policy = new_policy)
# plot the trajectories of the cart position
plot_cartpole(test_data)
This is probably not much better than a random policy, since the linear function approximation is too simple for this problem. Recall our previous theoretical demonstrate the of Q-function error propagation in dynamic treatment regimes. The error can accumulate over iterations, especially when the function approximation is poor, the fitted Q-iteration may not converge to a good policy. Most of the trajectories still end up with the pole falling down quickly. Next, we try a quadratic function approximation.
To learn a better Q-function, we can try a more flexible function approximation. Here we use a quadratic function approximation.
# Initial Q-function estimate: fit reward ~ state + action
quad_formula = y_targets ~ I(x^2) + I(x_dot^2) + I(theta^2) + I(theta_dot^2) + action +
action*x + action*x_dot + action*theta + action*theta_dot
y_targets = cartpole_data$reward
fit_qquad <- lm(quad_formula, data = cartpole_data)
# Settings
n_iter <- 100
gamma <- 0.9
# Start to iterate
for (iter in 1:n_iter) {
# Prepare data for next states with both actions
Data_Next = cartpole_data[cartpole_data$time > 1, ]
Last_T = cartpole_data$time == ave(cartpole_data$time, cartpole_data$episode, FUN = max)
Data_Current = cartpole_data[!Last_T, ]
Data_Next_A1 = Data_Next
Data_Next_A1$action = -1
Data_Next_A2 = Data_Next
Data_Next_A2$action = 1
# Predict Q-values for next states with both actions
Q_next_A1 <- predict(fit_qquad, newdata = Data_Next_A1)
Q_next_A2 <- predict(fit_qquad, newdata = Data_Next_A2)
# Compute max Q-value for next states
Q_next_max <- pmax(Q_next_A1, Q_next_A2)
# Compute bootstrapped targets
y_targets <- Data_Current$reward + gamma * Q_next_max
# Fit new regression model
fit_qquad <- lm(quad_formula, data = Data_Current)
}
Same as before, we can implement the greedy policy based on the fitted quadratic Q-function.
# Evaluate the learned policy
learned_policy <- function(state) {
greedy_policy(state, Qfun = fit_qquad)
}
# Generate new episodes using the learned policy
test_data <- generate_cartpole(n_episode = 50,
reward.method = "continuous",
policy = learned_policy)
# plot the trajectories of the cart position
plot_cartpole(test_data)
We now use the ranger package to implement fitted Q-iteration on the generated batch data. In the first step, we fit a regression model to predict the reward based on the current state and action. Note that the same functional approximation issue would happen, hence, we need to tune the function properly. For this question, we will use discrete rewards, which is slightly more difficult.
# generate discrete action data
set.seed(546)
cartpole_data = generate_cartpole(n_episode = 200, reward.method = "discrete")
# plot the trajectories of the cart position
plot_cartpole(cartpole_data)
# Initial Q-function estimate: fit reward ~ state + action
library(ranger)
nodesize = 100
ntrees = 500
rf_q <- ranger(
formula = reward ~ x + x_dot + theta + theta_dot + action,
data = cartpole_data,
num.trees = ntrees,
min.node.size = nodesize
)
To implement the fitted Q-iteration algorithm, we do the same as before, but now we use random forests as the function approximation.
# get all data without the first time step in each episode
Data_Next = cartpole_data[cartpole_data$time > 1, ]
# get all data without the last time step in each episode
Last_T = cartpole_data$time != ave(cartpole_data$time, cartpole_data$episode, FUN = max)
Data_Current = cartpole_data[Last_T, ]
# define data frame with next states and action = -1
Data_Next_A1 = Data_Next
Data_Next_A1$action = -1
# define data frame with next states and action = +1
Data_Next_A2 = Data_Next
Data_Next_A2$action = +1
current_action = Data_Current$action
# settings
gamma = 0.9
n_iter = 20
# start to iterate
for (iter in 1:n_iter) {
# Predict Q-values for next states with both actions
Q_next_A1 <- predict(rf_q, data = Data_Next_A1)$predictions
Q_next_A2 <- predict(rf_q, data = Data_Next_A2)$predictions
# Compute max Q-value for next states
Q_next_max <- pmax(Q_next_A1, Q_next_A2)
# best action at next stage
action_next <- ifelse(Q_next_A1 > Q_next_A2, -1, 1)
# how many actions changed as percentage
cat("Iteration :", iter, "| changed actions:",
round(sum(action_next != current_action) / length(current_action) * 100, 2), "%\n")
current_action <- action_next
# Compute bootstrapped targets
y_targets <- Data_Current$reward + gamma * Q_next_max
# Fit new regression model
rf_q <- ranger(
formula = y_targets ~ x + x_dot + theta + theta_dot + action,
data = Data_Current,
num.trees = ntrees,
min.node.size = nodesize
)
}
## Iteration : 1 | changed actions: 51.7 %
## Iteration : 2 | changed actions: 44.68 %
## Iteration : 3 | changed actions: 10.17 %
## Iteration : 4 | changed actions: 6.61 %
## Iteration : 5 | changed actions: 5.77 %
## Iteration : 6 | changed actions: 6.32 %
## Iteration : 7 | changed actions: 3.67 %
## Iteration : 8 | changed actions: 3.6 %
## Iteration : 9 | changed actions: 3.32 %
## Iteration : 10 | changed actions: 3.12 %
## Iteration : 11 | changed actions: 2.96 %
## Iteration : 12 | changed actions: 2.44 %
## Iteration : 13 | changed actions: 1.77 %
## Iteration : 14 | changed actions: 1.77 %
## Iteration : 15 | changed actions: 2.03 %
## Iteration : 16 | changed actions: 2.06 %
## Iteration : 17 | changed actions: 2.02 %
## Iteration : 18 | changed actions: 1.51 %
## Iteration : 19 | changed actions: 1.72 %
## Iteration : 20 | changed actions: 1.29 %
We can now implement this policy to see how well it performs
# Derive greedy policy from fitted Q-function
greedy_policy_rf <- function(state, Qfun) {
state_df <- data.frame(
x = state[1],
x_dot = state[2],
theta = state[3],
theta_dot = state[4]
)
# Action = -1
state_df$action <- -1
state_df$action_sq <- 1
Q_A1 <- predict(Qfun, data = state_df)$predictions
# Action = +1
state_df$action <- 1
state_df$action_sq <- 1
Q_A2 <- predict(Qfun, data = state_df)$predictions
if (Q_A1 > Q_A2) {
return(-1)
} else {
return(1)
}
}
# Evaluate the learned policy
new_policy <- function(state) {
greedy_policy_rf(state, Qfun = rf_q)
}
# Generate new episodes using the learned policy
test_data <- generate_cartpole(n_episode = 50,
reward.method = "discrete",
policy = new_policy)
# plot the trajectories of the cart position
plot_cartpole(test_data)
We can see that most of the trajectories now last much longer, indicating that the learned policy is effective in balancing the pole. In fact, our iteration did not run until convergence, as indicated by the percentage of changed actions in each iteration. With more iterations, the performance can be further improved. We can also notice that the learned policy did not care too much about the position \(x\). This is mainly because in the training samples, we rarely observe the reward being decreased to 0 due to the cart position going out of the designed range. Most of the failures are due to the pole angle \(\theta\) exceeding the threshold. Hence, the learned policy focuses more on controlling the pole angle rather than the cart position. This phenomenon is related to both extrapolation error and distributional shift in offline reinforcement learning.
This is known as the model-free setting in reinforcement learning, where the agent learns optimal policies directly from experience without explicitly modeling the environment’s dynamics. If we estimate the transition dynamics, it is known as the model-based setting.↩︎