Instruction
Students are encouraged to work together on homework. However,
sharing, copying, or providing any part of a homework solution
or code is an infraction of the University’s
rules on Academic Integrity. Any violation will be punished as
severely as possible. Final submissions must be uploaded to Gradescope. No
email or hard copy will be accepted. For late
submission policy and grading rubrics, please refer to the
course website.
- You are required to submit the rendered file
HWx_yourNetID.pdf. For example,
HW01_rqzhu.pdf. Please note that this must be a
.pdf file. .html format
cannot be accepted since they may not be readable in
Gradescope. All proofs must be typed in LaTeX format. Make all of your
R code chunks visible for grading.
- Include your Name and NetID in the report.
- If you use this file or the example homework
.Rmd file
as a template, be sure to remove this instruction
section.
- On random seed and reproducibility: You should use
R version \(\geq 4.0.0\).
This will ensure your random seed generation is the same as everyone
else. Please note that updating the R version may require
you to reinstall all of your packages. In the markdown file, set
seed properly so that the results can be reproduced.
- For some questions, there will be restrictions on what
packages/functions you can use. Please read the requirements carefully.
As long as the question does not specify such restrictions, you can use
anything.
- Using AI tools: you should provide a brief
statement on the involvement of AI tools on your homework.
Question 1 [40 pts]: Creating Your Own Tabular Dynamics
The following figure represents a simple tabular environment called
Frozen Lake (see detailed explanation here).
This is an environment with 25 states (5x5 grid) and 4 actions (up,
down, left, right). Mario starts at the top left cell and hopes to
rescue Princess Toadstool who is located at the bottom right cell.
However, there are several holes and turtles on this lake that could
cause trouble. Our goal is to find the optimal policy for Mario to reach
the goal safely and quickly.
Here are some details for you to consider:
- You should start with defining a tabular dynamics for this
environment. For example, consider using a \(25 \times 1\) vector to represent the state
space, where each entry corresponds to a cell in the grid (from top-left
to bottom-right). You can also do this as a \(5 \times 5\) matrix if you prefer.
- For each state, if you take an action (up, down, left, right), you
will deterministically move to another state (or stay at the same state
if you attempt to move outside the grid, see details in the
following).
- You will need to define some reward function of each (state, action)
pair
- If Mario steps into an icy hole, the reward is -0.5. In the original
game, this will terminate the game, but for this exercise, we will allow
Mario to climb out of the hole and continue his journey.
- If Mario steps into a turtle, he has to spend some time to fight it.
The reward is -1 but he will be able to continue his journey.
- If Mario attempts to step outside the grid, the
reward is -10. But he will stay at his original location (hence not
moving), and the game continues.
- When Mario moves to an empty cell, the reward is -0.1 due to
spending energy to walk, and the game continues.
- If Mario reaches Princess Toadstool, the reward is 10.
For such a deterministic tabular MDP, you do not need to simulate
data from the environment. Instead, you can directly implement value
iteration to find the optimal policy.
a). Please clearly define your state space, action space, transition
dynamics, and reward function. Implement the value iteration algorithm
to find the optimal value. Clearly explain the logic of your code. Use a
discount factor of 0.9. You should present the optimal value as a \(5 \times 5\) matrix which is easier to
visualize.
b). After obtaining the optimal value, derive the optimal policy by
choosing the action that maximizes the Bellman image at each state.
Provide a visualization of the optimal policy on a 5x5 matrix (e.g.,
using arrows or letters to indicate the optimal action at each
state).
Question 2 [40 pts]: Fitted Q-Iteration for Continuous Actions
In our lecture, we introduced the Fitted Q-Iteration (FQI) algorithm
for reinforcement learning with discrete action spaces (two treatment
options) and demonstrated that using the Cart-Pole example. In practice,
the problem can be extended to settings with continuous action space for
all force level between \([-1, 1]\).
Read our CartPole
simulator documentation carefully and perform the following
tasks:
a). First, we need to create a behavior policy that can generate data
with continuous action space. Please implement a behavior policy that
samples the action from a normal distribution with mean 0 and standard
deviation 0.5. If the generated action is outside the range of \([-1, 1]\), truncate it to be -1 or 1
accordingly. Provide some summary plots to show that your behavior
policy is working as expected.
b). Next, we will use a fitted Q-iteration algorithm to learn an
optimal policy from the data generated by the behavior policy. Note that
the formula we showed in the class only compares the Q-function for
either action being -1 or 1 and pick the maximum one. You need to adapt
this formulation to continuous action space. Several options are
possible, one simple way is to discretize the action space into several
bins and compare the Q-function values at those bins. But it can be very
unstable for certain choice of models. Another way is to figure out the
analytical form of the optimal action that maximizes the Q-function
given a state, if possible. This may depend on the regression method you
choose to fit the Q-function (e.g., its easier for a quadratic function
approximation, and for linear models, it might be meaningless). Do the
following:
- Clearly define and explain your fitted Q-iteration approach
- Implement the method with a discount factor of 0.9
- Show some convergence diagnostics (you may also define your
convergence criteria in the code) to demonstrate that your optimization
algorithm is working properly
- provide evidence to show that your policy works well when actually
deployed on the simulator
Question 3 [80 pts]: The Expedia Hotel Search Data Mini Project
We will continue to use the Expedia Hotel Search Data from Kaggle.
However, this time, the data is further processed in the following
way:
- We treat each destination as a unique “individual” and the searches
for that destination forms a trajectory of multiple observations that
individual made over time.
- Once a online user searches for a destination, Expedia will show a
list of hotels for that destination. This leads to the multiple entries
(page views) in the original data for that search. However, for our
purpose, we will do a summary of the information displayed to the user
on that page, and call it \((S_t,
A_t)\).
- The \(S_t\) here includes some
information of the hotel, user, search context, and competitors.
- The \(A_t\) includes information
about how the website is ranking the hotel list. Keep in mind that
Expedia wants to come up with the best ranking strategy to maximize
their revenue. Some variables we summerized are: the correlation between
the current ranking and the price of the hotels, the correlation between
the current ranking and the review scores of the hotels, etc. These
variables can be viewed as a summary of the ranking strategy \(A_t\) used by Expedia for that search.
- Once a user finishes this entire browsing, we will observe some
outcome \(R_t\) that indicates whether
the user clicked or booked a hotel, and how much spending was made out
of this browsing.
- Next time another user searches for the same destination, Expedia
will again show a list of hotels, potentially using a similar/different
ranking strategy \(A_{t+1}\) based on
the previous search results. This process continues for multiple
searches for the same destination to form a trajectory of observations
for that destination.
Our TA Mehrdad has once again prepared a processed version of the
data for us. This new format of the data is contained in the sub-folder
called “Expedia Data (condensed MDP)” on his GitHub
repository. Besides the processed data, you will find a file that
explains the data processing idea and summary of the data in that
sub-folder. Please read it carefully before your
proceed.
This is an open ended question. You are free to explore the data and
apply any offline RL methods you have learned in this course to learn an
optimal policy for Expedia to rank their hotel list. However, you need
to follow these steps:
- Start with exploratory data analysis to understand the data
structure, distributions etc. so that you can make informed decisions in
the following steps.
- Among all the variables available in the data, you need to pick
one reward variable out of these candidates:
total_clicks, total_gross_revenue,
gross_revenue_per_night to be your \(R_t\). Note that some of your following
analysis will depend on your choice of the reward variable, so please
explain your choice clearly. Once you pick your reward variable, the
other two candidates should be dropped from the data, since they cannot
be used as either state or action.
- Next, among all the variables available in the data, you need to
pick one variable to be your action variable \(A_t\). Here you have many choices, for
example,
corr_pos_price, corr_pos_review,
total_promotions, etc. Once you decide your action
variable, you will keep the rest of candidates as
states.
- Now you have a dataset with a lot of state variables, one action
variable, and one reward variable. We will treat this as an MDP problem
and fit RL models.
- Starting from here, you can apply any offline RL methods you have
learned in this course to learn an optimal policy. You are not required
to validate it on the testing data. However, you should demonstrate
these aspects of our model:
- You should clearly explain your approach and reasoning and your
choices every step of the way. This includes all details such as
hyper-parameter tuning, model selection, convergence diagnostics,
etc.
- Among the high-dimensional state variables, you should perform some
variable selection or dimension reduction to identify the most important
state variables that affect the Q-function and the optimal policy.
- After obtaining the final model, you should provide some
interpretation of the results via tables and figures. For example, you
may visualize how the optimal action changes with respect to some state
variables. And what do they mean in practice. Note that the signals in
this dataset could be relatively weak. Hence, you may need to be careful
in your analysis and interpretation.
- You should also discuss the limitations of your analysis and
potential improvements that can be made in the future.