\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\cA}{\mathcal{A}}\) \(\newcommand{\cB}{\mathcal{B}}\) \(\newcommand{\cC}{\mathcal{C}}\) \(\newcommand{\cD}{\mathcal{D}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cF}{\mathcal{F}}\) \(\newcommand{\cG}{\mathcal{G}}\) \(\newcommand{\cH}{\mathcal{H}}\) \(\newcommand{\cI}{\mathcal{I}}\) \(\newcommand{\cJ}{\mathcal{J}}\) \(\newcommand{\cK}{\mathcal{K}}\) \(\newcommand{\cL}{\mathcal{L}}\) \(\newcommand{\cM}{\mathcal{M}}\) \(\newcommand{\cN}{\mathcal{N}}\) \(\newcommand{\cO}{\mathcal{O}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\cQ}{\mathcal{Q}}\) \(\newcommand{\cR}{\mathcal{R}}\) \(\newcommand{\cS}{\mathcal{S}}\) \(\newcommand{\cT}{\mathcal{T}}\) \(\newcommand{\cU}{\mathcal{U}}\) \(\newcommand{\cV}{\mathcal{V}}\) \(\newcommand{\cW}{\mathcal{W}}\) \(\newcommand{\cX}{\mathcal{X}}\) \(\newcommand{\cY}{\mathcal{Y}}\) \(\newcommand{\cZ}{\mathcal{Z}}\) \(\newcommand{\bA}{\mathbf{A}}\) \(\newcommand{\bB}{\mathbf{B}}\) \(\newcommand{\bC}{\mathbf{C}}\) \(\newcommand{\bD}{\mathbf{D}}\) \(\newcommand{\bE}{\mathbf{E}}\) \(\newcommand{\bF}{\mathbf{F}}\) \(\newcommand{\bG}{\mathbf{G}}\) \(\newcommand{\bH}{\mathbf{H}}\) \(\newcommand{\bI}{\mathbf{I}}\) \(\newcommand{\bJ}{\mathbf{J}}\) \(\newcommand{\bK}{\mathbf{K}}\) \(\newcommand{\bL}{\mathbf{L}}\) \(\newcommand{\bM}{\mathbf{M}}\) \(\newcommand{\bN}{\mathbf{N}}\) \(\newcommand{\bO}{\mathbf{O}}\) \(\newcommand{\bP}{\mathbf{P}}\) \(\newcommand{\bQ}{\mathbf{Q}}\) \(\newcommand{\bR}{\mathbf{R}}\) \(\newcommand{\bS}{\mathbf{S}}\) \(\newcommand{\bT}{\mathbf{T}}\) \(\newcommand{\bU}{\mathbf{U}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\ba}{\mathbf{a}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\bd}{\mathbf{d}}\) \(\newcommand{\be}{\mathbf{e}}\) \(\newcommand{\bg}{\mathbf{g}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bi}{\mathbf{i}}\) \(\newcommand{\bj}{\mathbf{j}}\) \(\newcommand{\bk}{\mathbf{k}}\) \(\newcommand{\bl}{\mathbf{l}}\) \(\newcommand{\bm}{\mathbf{m}}\) \(\newcommand{\bn}{\mathbf{n}}\) \(\newcommand{\bo}{\mathbf{o}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bq}{\mathbf{q}}\) \(\newcommand{\br}{\mathbf{r}}\) \(\newcommand{\bs}{\mathbf{s}}\) \(\newcommand{\bt}{\mathbf{t}}\) \(\newcommand{\bu}{\mathbf{u}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\RR}{\mathbb{R}}\) \(\newcommand{\NN}{\mathbb{N}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\hpi}{\widehat{\pi}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\hbpi}{\widehat{\boldsymbol{\pi}}}\) \(\newcommand{\bxi}{\boldsymbol{\xi}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bzero}{\mathbf{0}}\) \(\newcommand{\T}{\text{T}}\) \(\newcommand{\Trace}{\text{Trace}}\) \(\newcommand{\Cov}{\text{Cov}}\) \(\newcommand{\Var}{\text{Var}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\Pr}{\text{Pr}}\) \(\newcommand{\pr}{\text{pr}}\) \(\newcommand{\pdf}{\text{pdf}}\) \(\newcommand{\P}{\text{P}}\) \(\newcommand{\p}{\text{p}}\) \(\newcommand{\One}{\mathbf{1}}\) \(\newcommand{\argmin}{\operatorname*{arg\,min}}\) \(\newcommand{\argmax}{\operatorname*{arg\,max}}\) \(\newcommand{\dtheta}{\frac{\partial}{\partial\theta} }\) \(\newcommand{\ptheta}{\nabla_\theta}\) \(\newcommand{\alert}[1]{\color{darkorange}{#1}}\) \(\newcommand{\alertr}[1]{\color{red}{#1}}\) \(\newcommand{\alertb}[1]{\color{blue}{#1}}\)

1 Fitted Q-Iteration

Up to now, there are several components that we have not really dealt with:

  • When the state space (and possibly the action space) is large or continuous, it is infeasible to represent the value function or Q-function exactly. We need to use function approximation to represent elements in the Bellman equation.
  • In many practical scenarios, the transition dynamics are unknown, and may also be difficult to estimate. Hence, can we perform the Bellman operator without estimating the transition kernel?1
  • In most medical applications (and many other human related studies), we cannot interact with the environment to collect new data. Instead, we only have a fixed batch of historical data. How can we perform policy evaluation and policy learning in this offline setting? This is essentially a causal inference problem: we need to estimate the value (or treatment effect) of a new policy \(\pi^\ast\) using data collected under a behavior policy \(\mu^b(\cdot \mid s)\).

A very popular algorithm that addresses all these challenges is the Fitted Q-Iteration algorithm, proposed by Ernst, Geurts, and Wehenkel (2005). The key idea is to approximate the Q-function using a regression model (e.g., random forests, kernel methods, or neural networks), based on the Bellman optimality equation, and iteratively update this estimate. The algorithm proceeds as follows:

  1. Initialize the Q-function estimate \(\hat{Q}_0(s,a)\). One may simply set it to zero or fit a regression of \(r_i\) on \((s_i,a_i)\) using the batch data.
  2. For each iteration \(k = 0, 1, 2, \ldots\):
    1. For each data point \((s_i, a_i, r_i, s'_i)\) in the dataset:
      • Compute the bootstrapped target \[ y_i = r_i + \gamma \max_{a'} \hat{Q}_k(s'_i, a'). \]
    2. Fit a regression model mapping \((s_i,a_i)\) to \(y_i\) to obtain the new estimate \(\hat{Q}_{k+1}(s,a)\).
  3. Repeat until convergence.

In this algorithm, the target value corresponds to the random quantity (due to the randomness of the transitioning and reward) inside the expectation of the Bellman optimality operator:

\[ \begin{aligned} \cT_Q^\ast Q(s,a) &= r(s,a) + \gamma \E_{S' \sim p(\cdot \mid s,a)} \left[ \max_{a'} Q(S', a') \right] \\ &= \E_{R, S' \sim p(\cdot \mid s,a)} \left[ R + \gamma \max_{a'} Q(S', a') \right]\\ &= \E \left[ Y \mid S=s, A=a \right], \end{aligned} \]

Since we only observe samples \((r_i, s'_i)\), the empirical target \(y_i\) provides a sample-based approximation to the Bellman optimality backup.

2 Cartpole Example

The CartPole environment is a classic benchmark problem in reinforcement learning. It involves balancing a pole on a moving cart by applying forces to the cart. The state space is continuous, and the action space can be either discrete (e.g., left or right force) or continuous (e.g., varying force magnitude). We have created an environment simulator in R. The details of the environment dynamics and parameters can be found in our cartpole document.

  # CartPole environment with actions in [-1, 1]
  cartpole_step <- function(state, action, 
                            reward.method = "discrete") {
    
    # Unpack state
    x         <- state[1]
    x_dot     <- state[2]
    theta     <- state[3]
    theta_dot <- state[4]
    
    # Parameters
    g  <- 9.8
    m_c <- 1.0
    m_p <- 0.1
    l   <- 0.5
    F_max <- 5
    tau <- 0.01
    x_th <- 0.2
    theta_th <- 12 * (pi / 180)
    
    total_mass <- m_c + m_p
    
    # Action in [-1, 1]; discrete {-1, +1} is a special case
    f <- action * F_max
    
    # Compute accelerations
    costheta <- cos(theta)
    sintheta <- sin(theta)
    
    temp <- (f + m_p * l * theta_dot^2 * sintheta) / total_mass
    
    theta_acc <- (g * sintheta - costheta * temp) /
                 (l * (4/3 - (m_p * costheta^2) / total_mass))
    
    x_acc <- temp - (m_p * l * theta_acc * costheta) / total_mass
    
    # Update state using semi-implicit Euler
    x_dot_new     <- x_dot + tau * x_acc
    x_new         <- x + tau * x_dot_new
    theta_dot_new <- theta_dot + tau * theta_acc
    theta_new     <- theta + tau * theta_dot_new
    
    new_state <- c(x_new, x_dot_new, theta_new, theta_dot_new)
    
    # Compute termination
    done <- (abs(x_new) > x_th) ||
            (abs(theta_new) > theta_th)
    
    # Compute reward
    if (reward.method == "discrete") {
      reward <- ifelse(abs(theta_new) > 0.1 | abs(x_new) > 0.1, 0, 1)
    } else if (reward.method == "continuous") {
      reward <- - ( 0.01 * x_new^2 + # position loss
                    100 * theta_new^2 + # angle loss
                    0.01 * x_dot_new^2 + # position stability 
                    0.1 * theta_dot_new^2) # angle stability
    } else {
      stop("Unknown reward.method")
    }
    
    return(list(state = new_state, reward = reward, done = done))
  }
  
  # Function to initialize the cartpole state
  cartpole_initial <- function() {
    state <- runif(4, min = -0.05, max = 0.05)
    return(state)
  }
  
  # function to generating episodes
  generate_cartpole <- function(n_episode = 1,
                                T_max = 200,
                                reward.method = "discrete",
                                policy = function(state) {
                                  sample(c(-1, 1), 1)
                                }) {

    all_rows <- list()    # store rows before binding
    row_id <- 1
    
    for (ep in 1:n_episode) {
      
      state <- cartpole_initial()
      
      for (t in 1:T_max) {
        
        # Generate action
        A_t <- policy(state)
        
        # check action range
        if (abs(A_t) > 1) {
          stop("Action out of range [-1, 1]")
        }
        
        # Step the environment
        out <- cartpole_step(state, A_t, reward.method = reward.method)
        
        S_next <- out$state
        R_t    <- out$reward
        done   <- out$done
        
        # Record (S_t, A_t, R_t)
        all_rows[[row_id]] <- data.frame(
          episode = ep,
          time    = t,
          x       = state[1],
          x_dot   = state[2],
          theta   = state[3],
          theta_dot = state[4],
          action  = A_t,
          reward  = R_t
        )
        row_id <- row_id + 1
        
        # Move to next state
        state <- S_next
        
        if (done) break
      }
    }
    
    # Combine into a dataframe
    batch_data <- do.call(rbind, all_rows)
    rownames(batch_data) <- NULL
    return(batch_data)
  }
  
  set.seed(546)
  cartpole_data = generate_cartpole(n_episode = 200, reward.method = "continuous")
  # plot the trajectories of the cart position and rewards
  library(ggplot2)
  library(patchwork)
## Warning: package 'patchwork' was built under R version 4.5.2
  
  plot_cartpole <- function(data) {
      
    p1 <- ggplot(data, aes(x = time, y = x, color = factor(episode))) +
      geom_line() +
      labs(title = "Cart Position (x)", x = "Time", y = "x") +
      theme_minimal() +
      scale_y_continuous(limits = c(-0.2, 0.2)) + 
      theme(legend.position = "none")
    
    p2 <- ggplot(data, aes(x = time, y = x_dot, color = factor(episode))) +
      geom_line() +
      labs(title = "Cart Velocity (x_dot)", x = "Time", y = "x_dot") +
      theme_minimal() +
      theme(legend.position = "none")
    
    p3 <- ggplot(data, aes(x = time, y = theta, color = factor(episode))) +
      geom_line() +
      labs(title = "Pole Angle (theta)", x = "Time", y = "theta") +
      theme_minimal() +
      scale_y_continuous(limits = c(-12 * (pi / 180), 12 * (pi / 180))) + 
      theme(legend.position = "none")
    
    p4 <- ggplot(data, aes(x = time, y = theta_dot, color = factor(episode))) +
      geom_line() +
      labs(title = "Pole Angular Velocity (theta_dot)", x = "Time", y = "theta_dot") +
      theme_minimal() +
      theme(legend.position = "none")
    
    p5 <- ggplot(data, aes(x = time, y = reward, color = factor(episode))) +
      geom_line() +
      labs(title = "Reward", x = "Time", y = "reward") +
      theme_minimal() +
      theme(legend.position = "none")
    
    # 2x2 layout
    (p1 | p2) /
    (p3 | p4) /
    p5
  }
  
  plot_cartpole(cartpole_data)

3 Fitted Q-Iteration With Linear Approximations

The original fitted Q-iteration algorithm was proposed with machine learning models such as random forests. Since the CartPole example is rather simple, we will implement this via a linear function

  # Initial Q-function estimate: fit reward ~ state + action
  linear_formula = y_targets ~ x + x_dot + theta + theta_dot + action + 
                               action*x + action*x_dot + action*theta + action*theta_dot

  y_targets = cartpole_data$reward
  fit_qliner <- lm(linear_formula, data = cartpole_data)
  
  # Settings
  n_iter <- 100
  gamma <- 0.9
  
  # Start to iterate
  for (iter in 1:n_iter) {
    
    # Prepare data for next states with both actions
    Data_Next = cartpole_data[cartpole_data$time > 1, ]
    Last_T = cartpole_data$time == ave(cartpole_data$time, cartpole_data$episode, FUN = max)
    Data_Current = cartpole_data[!Last_T, ]
    
    Data_Next_A1 = Data_Next
    Data_Next_A1$action = -1
    
    Data_Next_A2 = Data_Next
    Data_Next_A2$action = 1
    
    # Predict Q-values for next states with both actions
    Q_next_A1 <- predict(fit_qliner, newdata = Data_Next_A1)
    Q_next_A2 <- predict(fit_qliner, newdata = Data_Next_A2)
    
    # Compute max Q-value for next states
    Q_next_max <- pmax(Q_next_A1, Q_next_A2)

    # Compute bootstrapped targets
    y_targets <- Data_Current$reward + gamma * Q_next_max
    
    # Fit new regression model
    fit_qliner <- lm(linear_formula, data = Data_Current)
  }

Based on this fitted linear Q-function, we can derive a greedy policy by selecting the action that maximizes the Q-value for each state. This is essentially plug the states into the fitted Q-function and choose the action accordingly.

  # Derive greedy policy from fitted Q-function
  greedy_policy <- function(state, Qfun) {
    
    state_df <- data.frame(
      x = state[1],
      x_dot = state[2],
      theta = state[3],
      theta_dot = state[4]
    )
    
    # Action = -1
    state_df$action <- -1
    state_df$action_sq <- 1
    Q_A1 <- predict(Qfun, newdata = state_df)
    
    # Action = +1
    state_df$action <- 1
    state_df$action_sq <- 1
    Q_A2 <- predict(Qfun, newdata = state_df)
    
    if (Q_A1 > Q_A2) {
      return(-1)
    } else {
      return(1)
    }
  }

We can now use this function to generate new episodes under the learned policy and evaluate its performance.

  # Evaluate the learned policy
  new_policy <- function(state) {
    greedy_policy(state, Qfun = fit_qliner)
  }
  
  # Generate new episodes using the learned policy
  test_data <- generate_cartpole(n_episode = 50,
                                 reward.method = "continuous",
                                 policy = new_policy)
  
  # plot the trajectories of the cart position
  plot_cartpole(test_data)

This is probably not much better than a random policy, since the linear function approximation is too simple for this problem. Recall our previous theoretical demonstrate the of Q-function error propagation in dynamic treatment regimes. The error can accumulate over iterations, especially when the function approximation is poor, the fitted Q-iteration may not converge to a good policy. Most of the trajectories still end up with the pole falling down quickly. Next, we try a quadratic function approximation.

4 Fitted Q-Iteration With Quadratic Approximations

To learn a better Q-function, we can try a more flexible function approximation. Here we use a quadratic function approximation.

  # Initial Q-function estimate: fit reward ~ state + action
  quad_formula = y_targets ~ I(x^2) + I(x_dot^2) + I(theta^2) + I(theta_dot^2) + action + 
                          action*x + action*x_dot + action*theta + action*theta_dot

  y_targets = cartpole_data$reward
  fit_qquad <- lm(quad_formula, data = cartpole_data)
  
  # Settings
  n_iter <- 100
  gamma <- 0.9
  
  # Start to iterate
  for (iter in 1:n_iter) {
    
    # Prepare data for next states with both actions
    Data_Next = cartpole_data[cartpole_data$time > 1, ]
    Last_T = cartpole_data$time == ave(cartpole_data$time, cartpole_data$episode, FUN = max)
    Data_Current = cartpole_data[!Last_T, ]
    
    Data_Next_A1 = Data_Next
    Data_Next_A1$action = -1
    
    Data_Next_A2 = Data_Next
    Data_Next_A2$action = 1
    
    # Predict Q-values for next states with both actions
    Q_next_A1 <- predict(fit_qquad, newdata = Data_Next_A1)
    Q_next_A2 <- predict(fit_qquad, newdata = Data_Next_A2)
    
    # Compute max Q-value for next states
    Q_next_max <- pmax(Q_next_A1, Q_next_A2)

    # Compute bootstrapped targets
    y_targets <- Data_Current$reward + gamma * Q_next_max
    
    # Fit new regression model
    fit_qquad <- lm(quad_formula, data = Data_Current)
  }

Same as before, we can implement the greedy policy based on the fitted quadratic Q-function.

  # Evaluate the learned policy
  learned_policy <- function(state) {
    greedy_policy(state, Qfun = fit_qquad)
  }
  
  # Generate new episodes using the learned policy
  test_data <- generate_cartpole(n_episode = 50,
                                 reward.method = "continuous",
                                 policy = learned_policy)
  
  # plot the trajectories of the cart position
  plot_cartpole(test_data)

5 Using Random Forests

We now use the ranger package to implement fitted Q-iteration on the generated batch data. In the first step, we fit a regression model to predict the reward based on the current state and action. Note that the same functional approximation issue would happen, hence, we need to tune the function properly. For this question, we will use discrete rewards, which is slightly more difficult.

  # generate discrete action data
  set.seed(546)
  cartpole_data = generate_cartpole(n_episode = 200, reward.method = "discrete")

  # plot the trajectories of the cart position
  plot_cartpole(cartpole_data)

  
  # Initial Q-function estimate: fit reward ~ state + action
  library(ranger)
  nodesize = 100
  ntrees = 500
  
  rf_q <- ranger(
    formula = reward ~ x + x_dot + theta + theta_dot + action,
    data = cartpole_data,
    num.trees = ntrees,
    min.node.size = nodesize
  )

To implement the fitted Q-iteration algorithm, we do the same as before, but now we use random forests as the function approximation.

  # get all data without the first time step in each episode
  Data_Next = cartpole_data[cartpole_data$time > 1, ]

  # get all data without the last time step in each episode    
  Last_T = cartpole_data$time != ave(cartpole_data$time, cartpole_data$episode, FUN = max)
  Data_Current = cartpole_data[Last_T, ]
  
  # define data frame with next states and action = -1
  Data_Next_A1 = Data_Next
  Data_Next_A1$action = -1
  
  # define data frame with next states and action = +1
  Data_Next_A2 = Data_Next
  Data_Next_A2$action = +1
  
  current_action = Data_Current$action
  
  # settings
  gamma = 0.9  
  n_iter = 20
  
  # start to iterate
  for (iter in 1:n_iter) {
    
    # Predict Q-values for next states with both actions
    Q_next_A1 <- predict(rf_q, data = Data_Next_A1)$predictions
    Q_next_A2 <- predict(rf_q, data = Data_Next_A2)$predictions
    
    # Compute max Q-value for next states
    Q_next_max <- pmax(Q_next_A1, Q_next_A2)
    
    # best action at next stage
    action_next <- ifelse(Q_next_A1 > Q_next_A2, -1, 1)
    
    # how many actions changed as percentage
    cat("Iteration :", iter, "| changed actions:",
        round(sum(action_next != current_action) / length(current_action) * 100, 2), "%\n")
    
    current_action <- action_next
    
    # Compute bootstrapped targets
    y_targets <- Data_Current$reward + gamma * Q_next_max
    
    # Fit new regression model
    rf_q <- ranger(
      formula = y_targets ~ x + x_dot + theta + theta_dot + action,
      data = Data_Current,
      num.trees = ntrees,
      min.node.size = nodesize
    )
  }
## Iteration : 1 | changed actions: 51.7 %
## Iteration : 2 | changed actions: 44.68 %
## Iteration : 3 | changed actions: 10.17 %
## Iteration : 4 | changed actions: 6.61 %
## Iteration : 5 | changed actions: 5.77 %
## Iteration : 6 | changed actions: 6.32 %
## Iteration : 7 | changed actions: 3.67 %
## Iteration : 8 | changed actions: 3.6 %
## Iteration : 9 | changed actions: 3.32 %
## Iteration : 10 | changed actions: 3.12 %
## Iteration : 11 | changed actions: 2.96 %
## Iteration : 12 | changed actions: 2.44 %
## Iteration : 13 | changed actions: 1.77 %
## Iteration : 14 | changed actions: 1.77 %
## Iteration : 15 | changed actions: 2.03 %
## Iteration : 16 | changed actions: 2.06 %
## Iteration : 17 | changed actions: 2.02 %
## Iteration : 18 | changed actions: 1.51 %
## Iteration : 19 | changed actions: 1.72 %
## Iteration : 20 | changed actions: 1.29 %

We can now implement this policy to see how well it performs

  # Derive greedy policy from fitted Q-function
  greedy_policy_rf <- function(state, Qfun) {
    
    state_df <- data.frame(
      x = state[1],
      x_dot = state[2],
      theta = state[3],
      theta_dot = state[4]
    )
    
    # Action = -1
    state_df$action <- -1
    state_df$action_sq <- 1
    Q_A1 <- predict(Qfun, data = state_df)$predictions
    
    # Action = +1
    state_df$action <- 1
    state_df$action_sq <- 1
    Q_A2 <- predict(Qfun, data = state_df)$predictions
    
    if (Q_A1 > Q_A2) {
      return(-1)
    } else {
      return(1)
    }
  }
  # Evaluate the learned policy
  new_policy <- function(state) {
    greedy_policy_rf(state, Qfun = rf_q)
  }
  
  # Generate new episodes using the learned policy
  test_data <- generate_cartpole(n_episode = 50,
                                 reward.method = "discrete",
                                 policy = new_policy)
  
  # plot the trajectories of the cart position
  plot_cartpole(test_data)

We can see that most of the trajectories now last much longer, indicating that the learned policy is effective in balancing the pole. In fact, our iteration did not run until convergence, as indicated by the percentage of changed actions in each iteration. With more iterations, the performance can be further improved. We can also notice that the learned policy did not care too much about the position \(x\). This is mainly because in the training samples, we rarely observe the reward being decreased to 0 due to the cart position going out of the designed range. Most of the failures are due to the pole angle \(\theta\) exceeding the threshold. Hence, the learned policy focuses more on controlling the pole angle rather than the cart position. This phenomenon is related to both extrapolation error and distributional shift in offline reinforcement learning.


Ernst, Damien, Pierre Geurts, and Louis Wehenkel. 2005. “Tree-Based Batch Mode Reinforcement Learning.” Journal of Machine Learning Research 6.

  1. This is known as the model-free setting in reinforcement learning, where the agent learns optimal policies directly from experience without explicitly modeling the environment’s dynamics. If we estimate the transition dynamics, it is known as the model-based setting.↩︎