\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\cA}{\mathcal{A}}\) \(\newcommand{\cB}{\mathcal{B}}\) \(\newcommand{\cC}{\mathcal{C}}\) \(\newcommand{\cD}{\mathcal{D}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cF}{\mathcal{F}}\) \(\newcommand{\cG}{\mathcal{G}}\) \(\newcommand{\cH}{\mathcal{H}}\) \(\newcommand{\cI}{\mathcal{I}}\) \(\newcommand{\cJ}{\mathcal{J}}\) \(\newcommand{\cK}{\mathcal{K}}\) \(\newcommand{\cL}{\mathcal{L}}\) \(\newcommand{\cM}{\mathcal{M}}\) \(\newcommand{\cN}{\mathcal{N}}\) \(\newcommand{\cO}{\mathcal{O}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\cQ}{\mathcal{Q}}\) \(\newcommand{\cR}{\mathcal{R}}\) \(\newcommand{\cS}{\mathcal{S}}\) \(\newcommand{\cT}{\mathcal{T}}\) \(\newcommand{\cU}{\mathcal{U}}\) \(\newcommand{\cV}{\mathcal{V}}\) \(\newcommand{\cW}{\mathcal{W}}\) \(\newcommand{\cX}{\mathcal{X}}\) \(\newcommand{\cY}{\mathcal{Y}}\) \(\newcommand{\cZ}{\mathcal{Z}}\) \(\newcommand{\bA}{\mathbf{A}}\) \(\newcommand{\bB}{\mathbf{B}}\) \(\newcommand{\bC}{\mathbf{C}}\) \(\newcommand{\bD}{\mathbf{D}}\) \(\newcommand{\bE}{\mathbf{E}}\) \(\newcommand{\bF}{\mathbf{F}}\) \(\newcommand{\bG}{\mathbf{G}}\) \(\newcommand{\bH}{\mathbf{H}}\) \(\newcommand{\bI}{\mathbf{I}}\) \(\newcommand{\bJ}{\mathbf{J}}\) \(\newcommand{\bK}{\mathbf{K}}\) \(\newcommand{\bL}{\mathbf{L}}\) \(\newcommand{\bM}{\mathbf{M}}\) \(\newcommand{\bN}{\mathbf{N}}\) \(\newcommand{\bO}{\mathbf{O}}\) \(\newcommand{\bP}{\mathbf{P}}\) \(\newcommand{\bQ}{\mathbf{Q}}\) \(\newcommand{\bR}{\mathbf{R}}\) \(\newcommand{\bS}{\mathbf{S}}\) \(\newcommand{\bT}{\mathbf{T}}\) \(\newcommand{\bU}{\mathbf{U}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\ba}{\mathbf{a}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\bd}{\mathbf{d}}\) \(\newcommand{\be}{\mathbf{e}}\) \(\newcommand{\bg}{\mathbf{g}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bi}{\mathbf{i}}\) \(\newcommand{\bj}{\mathbf{j}}\) \(\newcommand{\bk}{\mathbf{k}}\) \(\newcommand{\bl}{\mathbf{l}}\) \(\newcommand{\bm}{\mathbf{m}}\) \(\newcommand{\bn}{\mathbf{n}}\) \(\newcommand{\bo}{\mathbf{o}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bq}{\mathbf{q}}\) \(\newcommand{\br}{\mathbf{r}}\) \(\newcommand{\bs}{\mathbf{s}}\) \(\newcommand{\bt}{\mathbf{t}}\) \(\newcommand{\bu}{\mathbf{u}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\RR}{\mathbb{R}}\) \(\newcommand{\NN}{\mathbb{N}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\hpi}{\widehat{\pi}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\hbpi}{\widehat{\boldsymbol{\pi}}}\) \(\newcommand{\bxi}{\boldsymbol{\xi}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bzero}{\mathbf{0}}\) \(\newcommand{\T}{\text{T}}\) \(\newcommand{\Trace}{\text{Trace}}\) \(\newcommand{\Cov}{\text{Cov}}\) \(\newcommand{\Var}{\text{Var}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\Pr}{\text{Pr}}\) \(\newcommand{\pr}{\text{pr}}\) \(\newcommand{\pdf}{\text{pdf}}\) \(\newcommand{\P}{\text{P}}\) \(\newcommand{\p}{\text{p}}\) \(\newcommand{\One}{\mathbf{1}}\) \(\newcommand{\argmin}{\operatorname*{arg\,min}}\) \(\newcommand{\argmax}{\operatorname*{arg\,max}}\) \(\newcommand{\dtheta}{\frac{\partial}{\partial\theta} }\) \(\newcommand{\ptheta}{\nabla_\theta}\) \(\newcommand{\alert}[1]{\color{darkorange}{#1}}\) \(\newcommand{\alertr}[1]{\color{red}{#1}}\) \(\newcommand{\alertb}[1]{\color{blue}{#1}}\)

1 Overview

This document summarizes the CartPole (inverted pendulum) environment and the simulation design used to generate offline data for reinforcement learning methods such as Fitted Q-Iteration. In Python, there are some standard libraries such as the gymnasium library that implement this environment. Here I try to reproduce the environment dynamics in R and physical models based on publicly available information. Since many of the tuning parameters are different, the dynamic will not behave the same as the Gym library.

2 Physical Model and Parameters

The environment is a continuous-state, discrete- or continuous-action Markov decision process with deterministic transitions. The goal is to keep the pole upright and the cart within the track for as long as possible. We consider a cart of mass \(m_c\) moving on a frictionless track, with a pole of mass \(m_p\) and total length \(2\ell\), hinged at the cart. The physical parameters used throughout are:

\[ \begin{aligned} g &= 9.8 \ \text{m/s}^2 \\ m_c &= 1.0 \ \text{kg} \\ m_p &= 0.1 \ \text{kg} \\ m_{\text{tot}} &= m_c + m_p = 1.1 \ \text{kg} \\ \ell &= 0.5 \ \text{m} \\ m_p \ell &= 0.05\ \text{kg·m} \end{aligned} \]

Parameter that controls our data generator

\[ \begin{aligned} \text{max force:} \qquad F_{\max} &= 5 \ \text{N}\\ \text{max bound:} \qquad x_{\max} &= 0.2 \ \text{m} \\ \text{max angle:} \qquad \theta_{\max} &= 15^\circ \approx 0.2095\ \text{rad} \\ \text{step size:} \qquad \quad \,\, \tau &= 0.01\ \text{s} \\ \end{aligned} \]

These parameters are chosen to match widely used implementations of the CartPole environment.

3 State space

The state at time \(t\) is

\[ S_t = (x_t, \dot x_t, \theta_t, \dot\theta_t) \in \mathbb{R}^4, \]

where

  • \(x_t\): cart horizontal position (meters),
  • \(\dot x_t\): cart velocity (m/s),
  • \(\theta_t\): pole angle relative to upright (radians),
  • \(\dot\theta_t\): pole angular velocity (rad/s).

Episodes start from small random perturbations around the upright equilibrium:

\[ (x_0, \dot x_0, \theta_0, \dot\theta_0) \sim \text{Uniform}([-0.05, 0.05]^4). \]

4 Action space and applied force

The environment accepts either discrete or continuous actions, all interpreted on the interval \([-1, 1]\).

  • Discrete actions: \[ A_t \in \{-1, +1\} \] representing a full-magnitude push to the left or right.

  • Continuous actions: \[ A_t \in [-1, +1] \]

In all cases, the horizontal force applied to the cart is

\[ F_t = A_t \, F_{\max}, \qquad |F_t| \le F_{\max}. \]

Thus discrete actions are a special case of the continuous control range.

5 Reward function

Two reward structures are considered.

  1. Survival-type reward:

    \[ R_t = \begin{cases} 1, & \text{if } |x_{t+1}| \le 0.1 \text{ and } |\theta_{t+1}| \le 0.1, \\ 0, & \text{otherwise}. \end{cases} \]

    This encourages policies that keep the system within safe bounds for as many steps as possible.

  2. Quadratic shaping reward:

    \[ R_t = - \left( 0.01 \, x_{t+1}^2 + 100 \, \theta_{t+1}^2 + 0.01 \, \dot x_{t+1}^2 + 0.1 \, \dot\theta_{t+1}^2 \right). \]

    This penalizes deviation from the upright equilibrium in both position and velocity.

6 Termination conditions

An episode terminates at the first time step \(t\) for which

\[ |x_t| > x_{\max} \quad\text{or}\quad |\theta_t| > \theta_{\max}. \]

In practice, an additional maximum horizon \(T_{\max}\) (for example 200 or 500) may be imposed, in which case an episode also terminates if \(t \ge T_{\max}\).

7 Continuous Transitional Dynamics

Given the current state

\[ (x, \dot x, \theta, \dot\theta) \]

and the applied force \(F\), define

\[ \text{temp} = \frac{F + (m_p \ell)\,\dot\theta^{2}\,\sin(\theta)}{m_{\text{tot}}}. \]

The angular acceleration of the pole and the horizontal acceleration of the cart are

\[ \theta_{\text{acc}} = \frac{ g\,\sin(\theta) - \cos(\theta)\,\text{temp} }{ \ell\left( \frac{4}{3} - \frac{m_p \cos^{2}(\theta)}{m_{\text{tot}}} \right) }, \]

\[ x_{\text{acc}} = \text{temp} - \frac{m_p \ell\, \theta_{\text{acc}}\, \cos(\theta)}{m_{\text{tot}}}. \]

These nonlinear equations follow from Newtonian mechanics applied to the coupled cart-pole system. We discretize the dynamics using a fixed time step \(\tau = 0.01\) and semi-implicit (symplectic) Euler integration:

\[ \begin{aligned} \dot x_{t+1} &= \dot x_t + \tau\, x_{\text{acc},t}, \\ x_{t+1} &= x_t + \tau\, \dot x_{t+1}, \\ \dot\theta_{t+1} &= \dot\theta_t + \tau\, \theta_{\text{acc},t}, \\ \theta_{t+1} &= \theta_t + \tau\, \dot\theta_{t+1}. \end{aligned} \]

The mapping \((S_t, A_t) \mapsto S_{t+1}\) is deterministic.

8 Simulation environment and step function

The following R code implements the CartPole dynamics to roll out the next step of the episode:

  # CartPole environment with actions in [-1, 1]
  cartpole_step <- function(state, action, 
                            reward.method = "discrete") {
    
    # Unpack state
    x         <- state[1]
    x_dot     <- state[2]
    theta     <- state[3]
    theta_dot <- state[4]
    
    # Parameters
    g  <- 9.8
    m_c <- 1.0
    m_p <- 0.1
    l   <- 0.5
    F_max <- 5
    tau <- 0.01
    x_th <- 0.2
    theta_th <- 12 * (pi / 180)
    
    total_mass <- m_c + m_p
    
    # Action in [-1, 1]; discrete {-1, +1} is a special case
    f <- action * F_max
    
    # Compute accelerations
    costheta <- cos(theta)
    sintheta <- sin(theta)
    
    temp <- (f + m_p * l * theta_dot^2 * sintheta) / total_mass
    
    theta_acc <- (g * sintheta - costheta * temp) /
                 (l * (4/3 - (m_p * costheta^2) / total_mass))
    
    x_acc <- temp - (m_p * l * theta_acc * costheta) / total_mass
    
    # Update state using semi-implicit Euler
    x_dot_new     <- x_dot + tau * x_acc
    x_new         <- x + tau * x_dot_new
    theta_dot_new <- theta_dot + tau * theta_acc
    theta_new     <- theta + tau * theta_dot_new
    
    new_state <- c(x_new, x_dot_new, theta_new, theta_dot_new)
    
    # Compute termination
    done <- (abs(x_new) > x_th) ||
            (abs(theta_new) > theta_th)
    
    # Compute reward
    if (reward.method == "discrete") {
      reward <- ifelse(abs(theta_new) > 0.1 | abs(x_new) > 0.1, 0, 1)
    } else if (reward.method == "continuous") {
      reward <- - ( 0.01 * x_new^2 + # position loss
                    100 * theta_new^2 + # angle loss
                    0.01 * x_dot_new^2 + # position stability 
                    0.1 * theta_dot_new^2) # angle stability
    } else {
      stop("Unknown reward.method")
    }
    
    return(list(state = new_state, reward = reward, done = done))
  }
  
  # Function to initialize the cartpole state
  cartpole_initial <- function() {
    state <- runif(4, min = -0.05, max = 0.05)
    return(state)
  }

We can then generate a dataset of such transitions for use in offline reinforcement learning experiments. For this example, we will use a random policy that samples actions from the discrete set \(\{-1, +1\}\).

  generate_cartpole <- function(n_episode = 1,
                                T_max = 200,
                                reward.method = "discrete",
                                policy = function(state) {
                                  sample(c(-1, 1), 1)
                                }) {

    all_rows <- list()    # store rows before binding
    row_id <- 1
    
    for (ep in 1:n_episode) {
      
      state <- cartpole_initial()
      
      for (t in 1:T_max) {
        
        # Generate action
        A_t <- policy(state)
        
        # check action range
        if (abs(A_t) > 1) {
          stop("Action out of range [-1, 1]")
        }
        
        # Step the environment
        out <- cartpole_step(state, A_t, reward.method = reward.method)
        
        S_next <- out$state
        R_t    <- out$reward
        done   <- out$done
        
        # Record (S_t, A_t, R_t)
        all_rows[[row_id]] <- data.frame(
          episode = ep,
          time    = t,
          x       = state[1],
          x_dot   = state[2],
          theta   = state[3],
          theta_dot = state[4],
          action  = A_t,
          reward  = R_t
        )
        row_id <- row_id + 1
        
        # Move to next state
        state <- S_next
        
        if (done) break
      }
    }
    
    # Combine into a dataframe
    batch_data <- do.call(rbind, all_rows)
    rownames(batch_data) <- NULL
    return(batch_data)
  }
  
  set.seed(546)
  cartpole_data = generate_cartpole()
  
  head(cartpole_data)

To visualize the generated data, we can plot the cart position and pole angle over time for a few episodes.

  # Generate 10 episodes of discrete {-1, +1} data
  set.seed(432)
  cartpole_data <- generate_cartpole(n_episode = 10)
  
  # plot the trajectories of the cart position and rewards
  library(ggplot2)
  library(patchwork)
## Warning: package 'patchwork' was built under R version 4.5.2
  
  plot_cartpole <- function(data) {
      
    p1 <- ggplot(data, aes(x = time, y = x, color = factor(episode))) +
      geom_line() +
      labs(title = "Cart Position (x)", x = "Time", y = "x") +
      theme_minimal() +
      scale_y_continuous(limits = c(-0.2, 0.2)) + 
      theme(legend.position = "none")
    
    p2 <- ggplot(data, aes(x = time, y = x_dot, color = factor(episode))) +
      geom_line() +
      labs(title = "Cart Velocity (x_dot)", x = "Time", y = "x_dot") +
      theme_minimal() +
      theme(legend.position = "none")
    
    p3 <- ggplot(data, aes(x = time, y = theta, color = factor(episode))) +
      geom_line() +
      labs(title = "Pole Angle (theta)", x = "Time", y = "theta") +
      theme_minimal() +
      scale_y_continuous(limits = c(-12 * (pi / 180), 12 * (pi / 180))) + 
      theme(legend.position = "none")
    
    p4 <- ggplot(data, aes(x = time, y = theta_dot, color = factor(episode))) +
      geom_line() +
      labs(title = "Pole Angular Velocity (theta_dot)", x = "Time", y = "theta_dot") +
      theme_minimal() +
      theme(legend.position = "none")
    
    p5 <- ggplot(data, aes(x = time, y = reward, color = factor(episode))) +
      geom_line() +
      labs(title = "Reward", x = "Time", y = "reward") +
      theme_minimal() +
      theme(legend.position = "none")
    
    # 2x2 layout
    (p1 | p2) /
    (p3 | p4) /
    p5
  }
  
  plot_cartpole(cartpole_data)