This document summarizes the CartPole (inverted pendulum) environment and the simulation design used to generate offline data for reinforcement learning methods such as Fitted Q-Iteration. In Python, there are some standard libraries such as the gymnasium library that implement this environment. Here I try to reproduce the environment dynamics in R and physical models based on publicly available information. Since many of the tuning parameters are different, the dynamic will not behave the same as the Gym library.
The environment is a continuous-state, discrete- or continuous-action Markov decision process with deterministic transitions. The goal is to keep the pole upright and the cart within the track for as long as possible. We consider a cart of mass \(m_c\) moving on a frictionless track, with a pole of mass \(m_p\) and total length \(2\ell\), hinged at the cart. The physical parameters used throughout are:
\[ \begin{aligned} g &= 9.8 \ \text{m/s}^2 \\ m_c &= 1.0 \ \text{kg} \\ m_p &= 0.1 \ \text{kg} \\ m_{\text{tot}} &= m_c + m_p = 1.1 \ \text{kg} \\ \ell &= 0.5 \ \text{m} \\ m_p \ell &= 0.05\ \text{kg·m} \end{aligned} \]
Parameter that controls our data generator
\[ \begin{aligned} \text{max force:} \qquad F_{\max} &= 5 \ \text{N}\\ \text{max bound:} \qquad x_{\max} &= 0.2 \ \text{m} \\ \text{max angle:} \qquad \theta_{\max} &= 15^\circ \approx 0.2095\ \text{rad} \\ \text{step size:} \qquad \quad \,\, \tau &= 0.01\ \text{s} \\ \end{aligned} \]
These parameters are chosen to match widely used implementations of the CartPole environment.
The state at time \(t\) is
\[ S_t = (x_t, \dot x_t, \theta_t, \dot\theta_t) \in \mathbb{R}^4, \]
where
Episodes start from small random perturbations around the upright equilibrium:
\[ (x_0, \dot x_0, \theta_0, \dot\theta_0) \sim \text{Uniform}([-0.05, 0.05]^4). \]
The environment accepts either discrete or continuous actions, all interpreted on the interval \([-1, 1]\).
Discrete actions: \[ A_t \in \{-1, +1\} \] representing a full-magnitude push to the left or right.
Continuous actions: \[ A_t \in [-1, +1] \]
In all cases, the horizontal force applied to the cart is
\[ F_t = A_t \, F_{\max}, \qquad |F_t| \le F_{\max}. \]
Thus discrete actions are a special case of the continuous control range.
Two reward structures are considered.
Survival-type reward:
\[ R_t = \begin{cases} 1, & \text{if } |x_{t+1}| \le 0.1 \text{ and } |\theta_{t+1}| \le 0.1, \\ 0, & \text{otherwise}. \end{cases} \]
This encourages policies that keep the system within safe bounds for as many steps as possible.
Quadratic shaping reward:
\[ R_t = - \left( 0.01 \, x_{t+1}^2 + 100 \, \theta_{t+1}^2 + 0.01 \, \dot x_{t+1}^2 + 0.1 \, \dot\theta_{t+1}^2 \right). \]
This penalizes deviation from the upright equilibrium in both position and velocity.
An episode terminates at the first time step \(t\) for which
\[ |x_t| > x_{\max} \quad\text{or}\quad |\theta_t| > \theta_{\max}. \]
In practice, an additional maximum horizon \(T_{\max}\) (for example 200 or 500) may be imposed, in which case an episode also terminates if \(t \ge T_{\max}\).
Given the current state
\[ (x, \dot x, \theta, \dot\theta) \]
and the applied force \(F\), define
\[ \text{temp} = \frac{F + (m_p \ell)\,\dot\theta^{2}\,\sin(\theta)}{m_{\text{tot}}}. \]
The angular acceleration of the pole and the horizontal acceleration of the cart are
\[ \theta_{\text{acc}} = \frac{ g\,\sin(\theta) - \cos(\theta)\,\text{temp} }{ \ell\left( \frac{4}{3} - \frac{m_p \cos^{2}(\theta)}{m_{\text{tot}}} \right) }, \]
\[ x_{\text{acc}} = \text{temp} - \frac{m_p \ell\, \theta_{\text{acc}}\, \cos(\theta)}{m_{\text{tot}}}. \]
These nonlinear equations follow from Newtonian mechanics applied to the coupled cart-pole system. We discretize the dynamics using a fixed time step \(\tau = 0.01\) and semi-implicit (symplectic) Euler integration:
\[ \begin{aligned} \dot x_{t+1} &= \dot x_t + \tau\, x_{\text{acc},t}, \\ x_{t+1} &= x_t + \tau\, \dot x_{t+1}, \\ \dot\theta_{t+1} &= \dot\theta_t + \tau\, \theta_{\text{acc},t}, \\ \theta_{t+1} &= \theta_t + \tau\, \dot\theta_{t+1}. \end{aligned} \]
The mapping \((S_t, A_t) \mapsto S_{t+1}\) is deterministic.
The following R code implements the CartPole dynamics to roll out the next step of the episode:
# CartPole environment with actions in [-1, 1]
cartpole_step <- function(state, action,
reward.method = "discrete") {
# Unpack state
x <- state[1]
x_dot <- state[2]
theta <- state[3]
theta_dot <- state[4]
# Parameters
g <- 9.8
m_c <- 1.0
m_p <- 0.1
l <- 0.5
F_max <- 5
tau <- 0.01
x_th <- 0.2
theta_th <- 12 * (pi / 180)
total_mass <- m_c + m_p
# Action in [-1, 1]; discrete {-1, +1} is a special case
f <- action * F_max
# Compute accelerations
costheta <- cos(theta)
sintheta <- sin(theta)
temp <- (f + m_p * l * theta_dot^2 * sintheta) / total_mass
theta_acc <- (g * sintheta - costheta * temp) /
(l * (4/3 - (m_p * costheta^2) / total_mass))
x_acc <- temp - (m_p * l * theta_acc * costheta) / total_mass
# Update state using semi-implicit Euler
x_dot_new <- x_dot + tau * x_acc
x_new <- x + tau * x_dot_new
theta_dot_new <- theta_dot + tau * theta_acc
theta_new <- theta + tau * theta_dot_new
new_state <- c(x_new, x_dot_new, theta_new, theta_dot_new)
# Compute termination
done <- (abs(x_new) > x_th) ||
(abs(theta_new) > theta_th)
# Compute reward
if (reward.method == "discrete") {
reward <- ifelse(abs(theta_new) > 0.1 | abs(x_new) > 0.1, 0, 1)
} else if (reward.method == "continuous") {
reward <- - ( 0.01 * x_new^2 + # position loss
100 * theta_new^2 + # angle loss
0.01 * x_dot_new^2 + # position stability
0.1 * theta_dot_new^2) # angle stability
} else {
stop("Unknown reward.method")
}
return(list(state = new_state, reward = reward, done = done))
}
# Function to initialize the cartpole state
cartpole_initial <- function() {
state <- runif(4, min = -0.05, max = 0.05)
return(state)
}
We can then generate a dataset of such transitions for use in offline reinforcement learning experiments. For this example, we will use a random policy that samples actions from the discrete set \(\{-1, +1\}\).
generate_cartpole <- function(n_episode = 1,
T_max = 200,
reward.method = "discrete",
policy = function(state) {
sample(c(-1, 1), 1)
}) {
all_rows <- list() # store rows before binding
row_id <- 1
for (ep in 1:n_episode) {
state <- cartpole_initial()
for (t in 1:T_max) {
# Generate action
A_t <- policy(state)
# check action range
if (abs(A_t) > 1) {
stop("Action out of range [-1, 1]")
}
# Step the environment
out <- cartpole_step(state, A_t, reward.method = reward.method)
S_next <- out$state
R_t <- out$reward
done <- out$done
# Record (S_t, A_t, R_t)
all_rows[[row_id]] <- data.frame(
episode = ep,
time = t,
x = state[1],
x_dot = state[2],
theta = state[3],
theta_dot = state[4],
action = A_t,
reward = R_t
)
row_id <- row_id + 1
# Move to next state
state <- S_next
if (done) break
}
}
# Combine into a dataframe
batch_data <- do.call(rbind, all_rows)
rownames(batch_data) <- NULL
return(batch_data)
}
set.seed(546)
cartpole_data = generate_cartpole()
head(cartpole_data)
To visualize the generated data, we can plot the cart position and pole angle over time for a few episodes.
# Generate 10 episodes of discrete {-1, +1} data
set.seed(432)
cartpole_data <- generate_cartpole(n_episode = 10)
# plot the trajectories of the cart position and rewards
library(ggplot2)
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.5.2
plot_cartpole <- function(data) {
p1 <- ggplot(data, aes(x = time, y = x, color = factor(episode))) +
geom_line() +
labs(title = "Cart Position (x)", x = "Time", y = "x") +
theme_minimal() +
scale_y_continuous(limits = c(-0.2, 0.2)) +
theme(legend.position = "none")
p2 <- ggplot(data, aes(x = time, y = x_dot, color = factor(episode))) +
geom_line() +
labs(title = "Cart Velocity (x_dot)", x = "Time", y = "x_dot") +
theme_minimal() +
theme(legend.position = "none")
p3 <- ggplot(data, aes(x = time, y = theta, color = factor(episode))) +
geom_line() +
labs(title = "Pole Angle (theta)", x = "Time", y = "theta") +
theme_minimal() +
scale_y_continuous(limits = c(-12 * (pi / 180), 12 * (pi / 180))) +
theme(legend.position = "none")
p4 <- ggplot(data, aes(x = time, y = theta_dot, color = factor(episode))) +
geom_line() +
labs(title = "Pole Angular Velocity (theta_dot)", x = "Time", y = "theta_dot") +
theme_minimal() +
theme(legend.position = "none")
p5 <- ggplot(data, aes(x = time, y = reward, color = factor(episode))) +
geom_line() +
labs(title = "Reward", x = "Time", y = "reward") +
theme_minimal() +
theme(legend.position = "none")
# 2x2 layout
(p1 | p2) /
(p3 | p4) /
p5
}
plot_cartpole(cartpole_data)