Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: Data Generator and Monte-Carlo Simulation

In this question, our goal is to simulate a set of observations, and we will use them later to estimate the optimal treatment regime using batch Q-learning. The data is a two-stage observational study, and we will generate the data based on the following mechanism:

You should generate 1000 independent samples and use them to estimate the policy value of the “behavior policy”.

Question 2: Batch Q-learning with DynTxRegime

Using the data generate from Question 1, estimate the optimal treatment regime via two-stage Q-learning using the DynTxRegime package. Note that in this package, you only need to include \(R_1 + R_2\) as the response variable in the second stage model (pretend that R1 is \(0\), and R2 is \(R_1 + R_2\)).

After estimating the optimal treatment regime, you need to generate new data based on the estimated optimal treatment regime. You can use the optTx() function using the fitted model and the generated covariates. Then extract the optimalTx from the output as your new treatment assignments. What’s the estimated optimal policy value? Is the optimal policy better then observed policy? Plot the histogram of the policy value from both policies and compare them. You can also compare the estimated density of the two policy values using the density() function.

Question 3: BOWL with DynTxRegime

In this question, we will use the data generate from Question 1 to estimate the optimal treatment regime via a two-stage BOWL. You should use the bowl() function from the DynTxRegime package. For this question, you should use the reward R1 and R2 as the response variable in the first and second stage, respectively.

After fitting the model, you should generate a new set of 1000 observations based on the estimated optimal treatment regime, what’s the estimated optimal policy value? Compare then performance between BOWL and the observed data.

Question 4: MDP and Policy Evaluation

In our lecture note we had an example of generating the MDP data. You can copy most of the code from our lecture note, expect that the policy and \(\gamma\) will be different.

\[ P_1 = \begin{pmatrix} 0.8 & 0.1 & 0.1 \\ 0.05 & 0.05 & 0.9 \\ 0.8 & 0.1 & 0.1 \end{pmatrix} \]

\[ P_2 = \begin{pmatrix} 0.5 & 0.25 & 0.25 \\ 0.1 & 0.8 & 0.1 \\ 0.2 & 0.2 & 0.6 \end{pmatrix} \]

Our immediate reward function is

\[ r = \begin{pmatrix} 5 & 3 \\ 1.6 & 3 \\ 4 & 2 \end{pmatrix} \]

Our first question is to estimate the policy value of a randomized trial policy:

\[ \pi = \begin{pmatrix} 0.6 & 0.4 \\ 0.5 & 0.5 \\ 0.4 & 0.6 \end{pmatrix} \]

Complete the following questions: