Instruction
Question 1: Data Generator and Monte-Carlo Simulation
Question 2: Batch Q-learning with DynTxRegime
Question 3: BOWL with DynTxRegime
Question 4: MDP and Policy Evaluation

Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

You are required to submit the rendered file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file. .html format cannot be accepted. Make all of your R code chunks visible for grading.
Include your Name and NetID in the report.
If you use this file or the example homework .Rmd file as a template, be sure to remove this instruction section.
Make sure that you set seed properly so that the results can be replicated if needed.
For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.
When using AI tools, you are encouraged to document your comment on your experience with AI tools especially when it’s difficult for them to grasp the idea of the question.
On random seed and reproducibility: Make sure the version of your R is \(\geq 4.0.0\). This will ensure your random seed generation is the same as everyone else. Please note that updating the R version may require you to re-install all of your packages.

Question 1: Data Generator and Monte-Carlo Simulation

In this question, our goal is to simulate a set of observations, and we will use them later to estimate the optimal treatment regime using batch Q-learning. The data is a two-stage observational study, and we will generate the data based on the following mechanism:

At the first stage, we observe baseline covariates \((X_{11}, X_{12}, X_{13}) \sim N(0, I_{3 \times 3})\)
If \(X_{13} > 0\), then treatment assignment \(A_1\) is generated with probability 0.8 to be \(-1\) and 0.2 to be \(1\). Otherwise, flip the two probabilities.
Generate stage 1 outcome \(R_1\) as \(R_1 = N(0.5X_{13}A_1, 1)\).
In stage 2, we observe two binary variables after taking actions in stage 1, based on the following model \[ X_{21} = I\{ N(1.25X_{11}A_1, 1) > 0 \} \quad \text{and} \quad X_{22} = I\{ N(-1.75X_{12}A_1,1) > 0 \} \]
If \(X_{21} - X_{22} > 0\), then treatment assignment \(A_2\) is generated with probability 0.8 to be \(-1\) and 0.2 to be \(1\). Otherwise, flip the two probabilities.
The Stage 2 outcome is generated according to \[ R_2 \sim N ((0.5 + R_1 + 0.5A_1 + 0.5X_{21} − 0.5X_{22})A_2, 1). \]
Finally add the two outcomes to get the policy value \(R_1 + R_2\).

You should generate 1000 independent samples and use them to estimate the policy value of the “behavior policy”.

Question 2: Batch Q-learning with `DynTxRegime`

Using the data generate from Question 1, estimate the optimal treatment regime via two-stage Q-learning using the DynTxRegime package. Note that in this package, you only need to include \(R_1 + R_2\) as the response variable in the second stage model (pretend that R1 is \(0\), and R2 is \(R_1 + R_2\)).

You should specify solver.method = 'lm' for both the main and control models at each stage.
In each stage, you should include suitable convariates (the ones that are available at that stage)

After estimating the optimal treatment regime, you need to generate new data based on the estimated optimal treatment regime. You can use the optTx() function using the fitted model and the generated covariates. Then extract the optimalTx from the output as your new treatment assignments. What’s the estimated optimal policy value? Is the optimal policy better then observed policy? Plot the histogram of the policy value from both policies and compare them. You can also compare the estimated density of the two policy values using the density() function.

Question 3: BOWL with `DynTxRegime`

In this question, we will use the data generate from Question 1 to estimate the optimal treatment regime via a two-stage BOWL. You should use the bowl() function from the DynTxRegime package. For this question, you should use the reward R1 and R2 as the response variable in the first and second stage, respectively.

Specify an appropriate propensity score model at each stage
You should specify solver.method = 'glm' and family = 'binomial' for the propensity score model
You should specify surrogate='hinge' and kernel='linear' for the BOWL model
In each stage, you should include suitable convariates (the ones that are available at that stage)

After fitting the model, you should generate a new set of 1000 observations based on the estimated optimal treatment regime, what’s the estimated optimal policy value? Compare then performance between BOWL and the observed data.

Question 4: MDP and Policy Evaluation

In our lecture note we had an example of generating the MDP data. You can copy most of the code from our lecture note, expect that the policy and \(\gamma\) will be different.

Stage 1 is will give the highest reward, picking action 1 will most likely to stay in the same state
Stage 2 has slightly lower reward, one may pick action 2 to stay in stage 2, or action 1 to transit to stage 3
Stage 3 has the lowest reward, but action 1 will most likely to transit into stage 1, which can give larger reward in the long run.
Overall, the optimal policy should be to pick action 1, which most likely to stay in stage 1, or if it was transited to stage 2, then pick action 1 to transit back to stage 3, and then pick action 1 again to transit back to stage 1.

\[ P_1 = \begin{pmatrix} 0.8 & 0.1 & 0.1 \\ 0.05 & 0.05 & 0.9 \\ 0.8 & 0.1 & 0.1 \end{pmatrix} \]

\[ P_2 = \begin{pmatrix} 0.5 & 0.25 & 0.25 \\ 0.1 & 0.8 & 0.1 \\ 0.2 & 0.2 & 0.6 \end{pmatrix} \]

Our immediate reward function is

\[ r = \begin{pmatrix} 5 & 3 \\ 1.6 & 3 \\ 4 & 2 \end{pmatrix} \]

Our first question is to estimate the policy value of a randomized trial policy:

\[ \pi = \begin{pmatrix} 0.6 & 0.4 \\ 0.5 & 0.5 \\ 0.4 & 0.6 \end{pmatrix} \]

Complete the following questions:

Use the code provided in our lecture notes to generate 1000 samples of the data and approximate the expected value of the policy. Use a discount faction \(\gamma = 0.5\).
Use the Bellman equation to calculate the expected value of the policy. Compare the two results. Is your simulation accurate enough?
What is the Q-function associated with this policy? You should be able to calculate the Q-function using the Bellman equation. Note that the Q-function should have 6 values (3 states and 2 actions) and you can organize them in a \(3x2\) matrix.
Using the the Q-function, calculate again the expected value of the policy using your estimated value from the Bellman equation. Note that this should be identical to your previous calculation.

STAT 546 Homework 4

Assigned: Apr 7, 2024; Due: 11:59 PM CT, Apr 18, 2024

Instruction

Question 1: Data Generator and Monte-Carlo Simulation

Question 2: Batch Q-learning with `DynTxRegime`

Question 3: BOWL with `DynTxRegime`

Question 4: MDP and Policy Evaluation

STAT 546 Homework 4

Assigned: Apr 7, 2024; Due: 11:59 PM CT, Apr 18, 2024

Instruction

Question 1: Data Generator and Monte-Carlo Simulation

Question 2: Batch Q-learning with DynTxRegime

Question 3: BOWL with DynTxRegime

Question 4: MDP and Policy Evaluation

Question 2: Batch Q-learning with `DynTxRegime`

Question 3: BOWL with `DynTxRegime`