The Data

We use the bmiData data from the R package DynTxRegime. The dataset is generated to mimic data from a two-stage randomized clinical trial that studied the effect of meal replacement shakes on adolescent obesity.

There are two treatment options available in the clinical trial, where CD refers to conventional diet, and MR refers to meal replacements.
At the start of clinical trial (Stage 1), baseline covariates including gender, race, parentBMI, and baselineBMI are observed, and the randomized treatment A1 is assigned to each patient.
4 months after the start of clinical trial (Stage 2), month4BMI is observed, and the randomized treatment A2 is assigned to each patient.
12 months into the clinical trial, the final outcome month12BMI is observed.
The final reward is defined as the percentage decrease of BMI for each patient, i.e.,

\[ \text{Reward} = -100*\frac{\text{month12BMI}-\text{baselineBMI}}{\text{baselineBMI}}. \]

Note that the original Q-learning can also be done by assuming that all the $R_t$ are zero, and only the last stage reward (the cumulative reward) is used. But here, to be consistent with our lecture, we will calculate an intermediate reward at 4 month.

  library(DynTxRegime)
  data(bmiData)
  head(bmiData)

  # Define the cumulative reward over two stages
  R <- -100*(bmiData[,6] - bmiData[,4])/bmiData[,4]
  
  # Define the reward at the first stage
  R1 <- -100*(bmiData[,5] - bmiData[,4])/bmiData[,4]
  
  # Define the reward at the second stage
  R2 <- R - R1

Backward Induction Procedure

Following our lecture note, the idea of batch Q-learning is a backward induction process. We first fit the model at the second stage, then use the estimated reward as the response to fit the model at the first stage.

Second Stage Analysis

At the second stage, we have already observed the results by month 4. Besides the treatment assignment A2, we also have the covariates parentBMI, month4BMI, and Race. We fit a model to predict the final reward based on these covariates.

  mod_sec = lm(R2 ~ parentBMI + month4BMI + A2 + A2*(race + parentBMI + month4BMI), 
               data=bmiData)
  summary(mod_sec)

## 
## Call:
## lm(formula = R2 ~ parentBMI + month4BMI + A2 + A2 * (race + parentBMI + 
##     month4BMI), data = bmiData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40198 -0.08702  0.00148  0.08869  0.31937 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.778e+00  1.182e-01  15.041   <2e-16 ***
## parentBMI      -9.284e-05  2.952e-03  -0.031    0.975    
## month4BMI      -5.130e-02  4.519e-03 -11.351   <2e-16 ***
## A2MR           -3.902e+00  1.690e-01 -23.087   <2e-16 ***
## race            9.858e-03  2.732e-02   0.361    0.719    
## A2MR:race      -1.601e-02  3.928e-02  -0.407    0.684    
## parentBMI:A2MR  6.402e-03  4.377e-03   1.463    0.145    
## month4BMI:A2MR  1.041e-01  5.984e-03  17.396   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1373 on 202 degrees of freedom
## Multiple R-squared:  0.7333, Adjusted R-squared:  0.724 
## F-statistic: 79.33 on 7 and 202 DF,  p-value: < 2.2e-16

Then the next task is to calculate the Q-function values. To do this, we need to predict each patients’ outcome under each treatment option. We can then compare the predicted outcomes and assign the treatment with the highest predicted outcome as the optimal treatment for each patient.

  # construct new data for prediction
  new_data1 = bmiData
  new_data2 = bmiData
  new_data1$A2 = rep('CD', dim(bmiData)[1])
  new_data2$A2 = rep('MR', dim(bmiData)[1])
  
  # predict the outcome of each patient under each treatment
  # using the model fitted previously
  qvalue2 = matrix(NA, nrow = dim(bmiData)[1], ncol=2)
  qvalue2[,1] = predict(mod_sec, new_data1)
  qvalue2[,2] = predict(mod_sec, new_data2)
  qvalue2 = data.frame(qvalue2)
  
  # the fitted values
  colnames(qvalue2) = c('CD','MR')
  head(qvalue2)

  # optimal treatment for each patient
  tmp = apply(as.matrix(qvalue2), 1, which.max)
  optimalTx2 = as.factor(ifelse(tmp == 1, 'CD', ifelse(tmp == 2, 'MR', tmp)))
  
  # the first 6 optimal treatment
  optimalTx2[1:6]

## [1] CD MR CD CD CD CD
## Levels: CD MR

First Stage Analysis

We then fit the model at the first stage. At here, we have the covariates parentBMI, baselineBMI, Race, and the treatment assignment A1. Recall our iterative formula

\[ \theta_t = \argmin_{\theta} \E_n \Big[ \underbrace{R_t + \max_{a_{t+1}} \bQ_{t+1}(\bS_{t+1}, \bA_t, A_{t+1} = a_{t+1}; \theta_{t+1})}_{\text{pseudo outcome at Stage $t$}} - \bQ_t(\bS_t, \bA_t; \theta) \Big]^2 \]

We first compute the best reward based on model from the second stage. Then we add the reward from the first stage, and threat the sum as the response to fit a model using covariates parentBMI, baselineBMI, Race and A1.

  # Calculate the estimated reward based on model from the second stage.
  pseudoR = apply(as.matrix(qvalue2),1,max) + R1

  # fit the Q model that predicts the pseudo reward at the first stage
  mod_fir = lm(pseudoR ~ parentBMI + baselineBMI + A1 + A1:(race + parentBMI + baselineBMI), 
               data=bmiData)
  
  # summary of the model
  summary(mod_fir)

## 
## Call:
## lm(formula = pseudoR ~ parentBMI + baselineBMI + A1 + A1:(race + 
##     parentBMI + baselineBMI), data = bmiData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.4362  -3.9274   0.4742   4.0112  13.7624 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.80152    5.05898  -0.356 0.722135    
## parentBMI        -0.06477    0.10837  -0.598 0.550731    
## baselineBMI       0.32096    0.14482   2.216 0.027784 *  
## A1MR              9.96829    7.74214   1.288 0.199381    
## A1CD:race        -0.41011    1.06335  -0.386 0.700140    
## A1MR:race        -0.62478    1.07367  -0.582 0.561274    
## parentBMI:A1MR   -1.25129    0.14316  -8.740 8.99e-16 ***
## baselineBMI:A1MR  0.78765    0.21580   3.650 0.000334 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.371 on 202 degrees of freedom
## Multiple R-squared:  0.5311, Adjusted R-squared:  0.5148 
## F-statistic: 32.68 on 7 and 202 DF,  p-value: < 2.2e-16

  # q-value for each patient
  new_data1 = bmiData
  new_data2 = bmiData
  new_data1$A1 = rep('CD',dim(bmiData)[1])
  new_data2$A1 = rep('MR',dim(bmiData)[1])
  
  # predict the outcome of each patient under each treatment
  # using the model fitted previously
  qvalue1 = matrix(NA, nrow = dim(bmiData)[1], ncol=2)
  qvalue1[,1] = predict(mod_fir, new_data1)
  qvalue1[,2] = predict(mod_fir, new_data2)
  qvalue1 = data.frame(qvalue1)
  
  # the fitted values
  colnames(qvalue1)=c('CD','MR')
  head(qvalue1)

  # optimal treatment for each patient at the first stage
  tmp = apply(as.matrix(qvalue1), 1, which.max)
  optimalTx1 = as.factor(ifelse(tmp == 1, 'CD', ifelse(tmp == 2, 'MR', tmp)))
  optimalTx1[1:6]

## [1] CD MR MR MR MR MR
## Levels: CD MR

  # the estimated optimal reward for each patient
  Rbest = apply(qvalue1, 1, max)
  
  # comparing the estimated optimal reward with the observed reward
  mean(Rbest)

## [1] 9.45116

  mean(R)

## [1] 6.455876

The `DynTxRegime` package

The above backward induction can be implemented via the R package DynTxRegime, please see reference manual for more details.

In this code, we consider a slightly different setting, in which all the immediate rewards $R_t$ are zero and we simply observe the final reward at the last stage.

\[ \text{Reward} = -100*\frac{\text{month12BMI}-\text{baselineBMI}}{\text{baselineBMI}}. \]

This is also a very popular case to run Q-learning, if we do not have or don’t know how to define the immediate rewards at each stage. However, doing so does not change our estimation procedure. We first specify models at the second stage using functions defined in the DynTxRegime package.

moMain specifies the main effect covariates (without interaction to the treatment)
moCont specifies covariates with interaction to the treatment at current cage

We then use the qLearn function to fit the model at the final stage. qLearn performs a single step of Q-learning algorithm. moMain and moCont together determines the covariates of the model, response refers to the final reward, and txName refers to the treatment variable name assigned at this stage. We can see that the obtained result is the same as our own model.

  library(DynTxRegime)

  # define the aggregated reward over two stages
  y12 <- -100*(bmiData[,6L] - bmiData[,4L])/bmiData[,4L]

  # Specify models at the second stage
  moMain <- buildModelObj(model = ~ parentBMI + month4BMI,
                                                solver.method = 'lm')   # specify main effect 
  moCont <- buildModelObj(model = ~ race + parentBMI+month4BMI,
                                                solver.method = 'lm')   # specify interaction with treatment
  
  # Second-Stage Analysis
  fitSS <- qLearn(moMain = moMain, moCont = moCont,
                  data = bmiData, 
                  response = y12, 
                  txName = 'A2')

## First step of the Q-Learning Algorithm.
## 
## Outcome regression.
## Combined outcome regression model: ~ parentBMI+month4BMI + A2 + A2:(race+parentBMI+month4BMI) .
## Regression analysis for Combined:
## 
## Call:
## lm(formula = YinternalY ~ parentBMI + month4BMI + A2 + A2:race + 
##     parentBMI:A2 + month4BMI:A2, data = data)
## 
## Coefficients:
##    (Intercept)       parentBMI       month4BMI            A2MR       A2CD:race  
##       48.04514        -0.35715        -0.83827       -14.54904        -0.25018  
##      A2MR:race  parentBMI:A2MR  month4BMI:A2MR  
##        1.64329         0.39890         0.02297  
## 
## 
## Recommended Treatments:
##  CD  MR 
##  98 112 
## 
## Estimated value: 7.54403

  show(fitSS)

## Q-Learning: step 1 
## Outcome Regression Analysis
## Combined 
## 
## Call:
## lm(formula = YinternalY ~ parentBMI + month4BMI + A2 + A2:race + 
##     parentBMI:A2 + month4BMI:A2, data = data)
## 
## Coefficients:
##    (Intercept)       parentBMI       month4BMI            A2MR       A2CD:race  
##       48.04514        -0.35715        -0.83827       -14.54904        -0.25018  
##      A2MR:race  parentBMI:A2MR  month4BMI:A2MR  
##        1.64329         0.39890         0.02297  
## 
## Recommended Treatments:
##  CD  MR 
##  98 112 
## 
## Estimated value: 7.54403

  # you can extract the optimal treatment and decision function
  optTx(fitSS)$optimalTx[1:6]

## [1] MR CD CD CD CD CD
## Levels: CD MR

  optTx(fitSS)$decisionFunc[1:6,]

##             CD       MR
## [1,]  7.818515 8.553214
## [2,]  6.771519 5.095286
## [3,]  8.376248 6.696326
## [4,] 10.965522 8.130337
## [5,] 10.083604 8.743228
## [6,] 10.697010 8.576252

We then fit the model at the first stage. The only difference comparing with the second stage is that we should assign the model we gained from the previous stage fitSS to response, and the treatment assigned is A1 here. We can also check that the result we gained at this step is the same as our own model.

  # First-stage Analysis
  moMain <- buildModelObj(model = ~ parentBMI + baselineBMI,
                          solver.method = 'lm') # main effect

  moCont <- buildModelObj(model = ~ race + parentBMI + baselineBMI,
                          solver.method = 'lm') # interaction effects
  
  # fit the model at the first stage
  fitFS <- qLearn(moMain = moMain, moCont = moCont,
                  data = bmiData, 
                  response = fitSS, 
                  txName = 'A1')

## Step 2 of the Q-Learning Algorithm.
## 
## Outcome regression.
## Combined outcome regression model: ~ parentBMI+baselineBMI + A1 + A1:(race+parentBMI+baselineBMI) .
## Regression analysis for Combined:
## 
## Call:
## lm(formula = YinternalY ~ parentBMI + baselineBMI + A1 + A1:race + 
##     parentBMI:A1 + baselineBMI:A1, data = data)
## 
## Coefficients:
##      (Intercept)         parentBMI       baselineBMI              A1MR  
##         36.73123          -0.08373          -0.68951           2.78926  
##        A1CD:race         A1MR:race    parentBMI:A1MR  baselineBMI:A1MR  
##          0.59410           0.65941          -0.36938           0.23110  
## 
## 
## Recommended Treatments:
##  CD  MR 
## 121  89 
## 
## Estimated value: 8.374786

  optTx(fitFS)$optimalTx[1:6]

## [1] CD MR MR MR MR MR
## Levels: CD MR

  optTx(fitFS)$decisionFunc[1:6,]

##             CD        MR
## [1,]  9.967447  9.433248
## [2,]  8.482970  8.746729
## [3,]  8.794969  8.913004
## [4,]  9.119389 10.236224
## [5,] 11.088749 12.234857
## [6,]  8.981291  9.422195

  summary(fitFS)

## $outcome
## $outcome$Combined
## 
## Call:
## lm(formula = YinternalY ~ parentBMI + baselineBMI + A1 + A1:race + 
##     parentBMI:A1 + baselineBMI:A1, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1295 -1.1803 -0.0718  1.3649  4.1127 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      36.73123    1.74405  21.061  < 2e-16 ***
## parentBMI        -0.08373    0.03736  -2.241  0.02609 *  
## baselineBMI      -0.68951    0.04992 -13.811  < 2e-16 ***
## A1MR              2.78926    2.66905   1.045  0.29725    
## A1CD:race         0.59410    0.36658   1.621  0.10666    
## A1MR:race         0.65941    0.37014   1.782  0.07633 .  
## parentBMI:A1MR   -0.36938    0.04935  -7.484 2.17e-12 ***
## baselineBMI:A1MR  0.23110    0.07440   3.106  0.00217 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.851 on 202 degrees of freedom
## Multiple R-squared:  0.768,  Adjusted R-squared:  0.7599 
## F-statistic: 95.51 on 7 and 202 DF,  p-value: < 2.2e-16
## 
## 
## 
## $optTx
##  CD  MR 
## 121  89 
## 
## $value
## [1] 8.374786

Batch Q-learning for Finite Horizon Problems

Yuhan Li and Ruoqing Zhu

Last Updated: April 01, 2024

The Data

Backward Induction Procedure

Second Stage Analysis

First Stage Analysis

The `DynTxRegime` package

Batch Q-learning for Finite Horizon Problems

Yuhan Li and Ruoqing Zhu

Last Updated: April 01, 2024

The Data

Backward Induction Procedure

Second Stage Analysis

First Stage Analysis

The DynTxRegime package

The `DynTxRegime` package