Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1 [40 pts]: Creating Your Own Tabular Dynamics

The following figure represents a simple tabular environment called Frozen Lake (see detailed explanation here). This is an environment with 25 states (5x5 grid) and 4 actions (up, down, left, right). Mario starts at the top left cell and hopes to rescue Princess Toadstool who is located at the bottom right cell. However, there are several holes and turtles on this lake that could cause trouble. Our goal is to find the optimal policy for Mario to reach the goal safely and quickly.

Here are some details for you to consider:

For such a deterministic tabular MDP, you do not need to simulate data from the environment. Instead, you can directly implement value iteration to find the optimal policy.

a). Please clearly define your state space, action space, transition dynamics, and reward function. Implement the value iteration algorithm to find the optimal value. Clearly explain the logic of your code. Use a discount factor of 0.9. You should present the optimal value as a \(5 \times 5\) matrix which is easier to visualize.

b). After obtaining the optimal value, derive the optimal policy by choosing the action that maximizes the Bellman image at each state. Provide a visualization of the optimal policy on a 5x5 matrix (e.g., using arrows or letters to indicate the optimal action at each state).

Question 2 [40 pts]: Fitted Q-Iteration for Continuous Actions

In our lecture, we introduced the Fitted Q-Iteration (FQI) algorithm for reinforcement learning with discrete action spaces (two treatment options) and demonstrated that using the Cart-Pole example. In practice, the problem can be extended to settings with continuous action space for all force level between \([-1, 1]\). Read our CartPole simulator documentation carefully and perform the following tasks:

a). First, we need to create a behavior policy that can generate data with continuous action space. Please implement a behavior policy that samples the action from a normal distribution with mean 0 and standard deviation 0.5. If the generated action is outside the range of \([-1, 1]\), truncate it to be -1 or 1 accordingly. Provide some summary plots to show that your behavior policy is working as expected.

b). Next, we will use a fitted Q-iteration algorithm to learn an optimal policy from the data generated by the behavior policy. Note that the formula we showed in the class only compares the Q-function for either action being -1 or 1 and pick the maximum one. You need to adapt this formulation to continuous action space. Several options are possible, one simple way is to discretize the action space into several bins and compare the Q-function values at those bins. But it can be very unstable for certain choice of models. Another way is to figure out the analytical form of the optimal action that maximizes the Q-function given a state, if possible. This may depend on the regression method you choose to fit the Q-function (e.g., its easier for a quadratic function approximation, and for linear models, it might be meaningless). Do the following:

Question 3 [80 pts]: The Expedia Hotel Search Data Mini Project

We will continue to use the Expedia Hotel Search Data from Kaggle. However, this time, the data is further processed in the following way:

Our TA Mehrdad has once again prepared a processed version of the data for us. This new format of the data is contained in the sub-folder called “Expedia Data (condensed MDP)” on his GitHub repository. Besides the processed data, you will find a file that explains the data processing idea and summary of the data in that sub-folder. Please read it carefully before your proceed.

This is an open ended question. You are free to explore the data and apply any offline RL methods you have learned in this course to learn an optimal policy for Expedia to rank their hotel list. However, you need to follow these steps: