Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to re-install all of your packages.In this question, our goal is to simulate a set of observations, and we will use them later to estimate the optimal treatment regime using batch Q-learning. The data is a two-stage observational study, and we will generate the data based on the following mechanism:
You should generate 1000 independent samples and use them to estimate the policy value of the “behavior policy”.
DynTxRegime
Using the data generate from Question 1, estimate the optimal
treatment regime via two-stage Q-learning using the
DynTxRegime
package. Note that in this package, you only
need to include \(R_1 + R_2\) as the
response variable in the second stage model (pretend that
R1
is \(0\), and
R2
is \(R_1 + R_2\)).
solver.method = 'lm'
for both the
main and control models at each stage.After estimating the optimal treatment regime, you need to generate
new data based on the estimated optimal treatment regime. You can use
the optTx()
function using the fitted model and the
generated covariates. Then extract the optimalTx
from the
output as your new treatment assignments. What’s the estimated optimal
policy value? Is the optimal policy better then observed policy? Plot
the histogram of the policy value from both policies and compare them.
You can also compare the estimated density of the two policy values
using the density()
function.
DynTxRegime
In this question, we will use the data generate from Question 1 to
estimate the optimal treatment regime via a two-stage BOWL. You should
use the bowl()
function from the DynTxRegime
package. For this question, you should use the reward R1
and R2
as the response variable in the first and second
stage, respectively.
solver.method = 'glm'
and
family = 'binomial'
for the propensity score modelsurrogate='hinge'
and
kernel='linear'
for the BOWL modelAfter fitting the model, you should generate a new set of 1000 observations based on the estimated optimal treatment regime, what’s the estimated optimal policy value? Compare then performance between BOWL and the observed data.
In our lecture note we had an example of generating the MDP data. You can copy most of the code from our lecture note, expect that the policy and \(\gamma\) will be different.
\[ P_1 = \begin{pmatrix} 0.8 & 0.1 & 0.1 \\ 0.05 & 0.05 & 0.9 \\ 0.8 & 0.1 & 0.1 \end{pmatrix} \]
\[ P_2 = \begin{pmatrix} 0.5 & 0.25 & 0.25 \\ 0.1 & 0.8 & 0.1 \\ 0.2 & 0.2 & 0.6 \end{pmatrix} \]
Our immediate reward function is
\[ r = \begin{pmatrix} 5 & 3 \\ 1.6 & 3 \\ 4 & 2 \end{pmatrix} \]
Our first question is to estimate the policy value of a randomized trial policy:
\[ \pi = \begin{pmatrix} 0.6 & 0.4 \\ 0.5 & 0.5 \\ 0.4 & 0.6 \end{pmatrix} \]
Complete the following questions: