Stat 546 Homework 6

Instruction
Question 1 [40 pts]: Creating Your Own Tabular Dynamics
Question 2 [40 pts]: Fitted Q-Iteration for Continuous Actions
Question 3 [80 pts]: The Expedia Hotel Search Data Mini Project

Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

You are required to submit the rendered file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file. .html format cannot be accepted since they may not be readable in Gradescope. All proofs must be typed in LaTeX format. Make all of your R code chunks visible for grading.
Include your Name and NetID in the report.
If you use this file or the example homework .Rmd file as a template, be sure to remove this instruction section.
On random seed and reproducibility: You should use R version \(\geq 4.0.0\). This will ensure your random seed generation is the same as everyone else. Please note that updating the R version may require you to reinstall all of your packages. In the markdown file, set seed properly so that the results can be reproduced.
For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.
Using AI tools: you should provide a brief statement on the involvement of AI tools on your homework.

Question 1 [40 pts]: Creating Your Own Tabular Dynamics

The following figure represents a simple tabular environment called Frozen Lake (see detailed explanation here). This is an environment with 25 states (5x5 grid) and 4 actions (up, down, left, right). Mario starts at the top left cell and hopes to rescue Princess Toadstool who is located at the bottom right cell. However, there are several holes and turtles on this lake that could cause trouble. Our goal is to find the optimal policy for Mario to reach the goal safely and quickly.

Here are some details for you to consider:

You should start with defining a tabular dynamics for this environment. For example, consider using a \(25 \times 1\) vector to represent the state space, where each entry corresponds to a cell in the grid (from top-left to bottom-right). You can also do this as a \(5 \times 5\) matrix if you prefer.
For each state, if you take an action (up, down, left, right), you will deterministically move to another state (or stay at the same state if you attempt to move outside the grid, see details in the following).
You will need to define some reward function of each (state, action) pair
- If Mario steps into an icy hole, the reward is -0.5. In the original game, this will terminate the game, but for this exercise, we will allow Mario to climb out of the hole and continue his journey.
- If Mario steps into a turtle, he has to spend some time to fight it. The reward is -1 but he will be able to continue his journey.
- If Mario attempts to step outside the grid, the reward is -10. But he will stay at his original location (hence not moving), and the game continues.
- When Mario moves to an empty cell, the reward is -0.1 due to spending energy to walk, and the game continues.
- If Mario reaches Princess Toadstool, the reward is 10.

For such a deterministic tabular MDP, you do not need to simulate data from the environment. Instead, you can directly implement value iteration to find the optimal policy.

a). Please clearly define your state space, action space, transition dynamics, and reward function. Implement the value iteration algorithm to find the optimal value. Clearly explain the logic of your code. Use a discount factor of 0.9. You should present the optimal value as a \(5 \times 5\) matrix which is easier to visualize.

b). After obtaining the optimal value, derive the optimal policy by choosing the action that maximizes the Bellman image at each state. Provide a visualization of the optimal policy on a 5x5 matrix (e.g., using arrows or letters to indicate the optimal action at each state).

Question 2 [40 pts]: Fitted Q-Iteration for Continuous Actions

In our lecture, we introduced the Fitted Q-Iteration (FQI) algorithm for reinforcement learning with discrete action spaces (two treatment options) and demonstrated that using the Cart-Pole example. In practice, the problem can be extended to settings with continuous action space for all force level between \([-1, 1]\). Read our CartPole simulator documentation carefully and perform the following tasks:

a). First, we need to create a behavior policy that can generate data with continuous action space. Please implement a behavior policy that samples the action from a normal distribution with mean 0 and standard deviation 0.5. If the generated action is outside the range of \([-1, 1]\), truncate it to be -1 or 1 accordingly. Provide some summary plots to show that your behavior policy is working as expected.

b). Next, we will use a fitted Q-iteration algorithm to learn an optimal policy from the data generated by the behavior policy. Note that the formula we showed in the class only compares the Q-function for either action being -1 or 1 and pick the maximum one. You need to adapt this formulation to continuous action space. Several options are possible, one simple way is to discretize the action space into several bins and compare the Q-function values at those bins. But it can be very unstable for certain choice of models. Another way is to figure out the analytical form of the optimal action that maximizes the Q-function given a state, if possible. This may depend on the regression method you choose to fit the Q-function （e.g., its easier for a quadratic function approximation, and for linear models, it might be meaningless). Do the following:

Clearly define and explain your fitted Q-iteration approach
Implement the method with a discount factor of 0.9
Show some convergence diagnostics (you may also define your convergence criteria in the code) to demonstrate that your optimization algorithm is working properly
provide evidence to show that your policy works well when actually deployed on the simulator

Question 3 [80 pts]: The Expedia Hotel Search Data Mini Project

We will continue to use the Expedia Hotel Search Data from Kaggle. However, this time, the data is further processed in the following way:

We treat each destination as a unique “individual” and the searches for that destination forms a trajectory of multiple observations that individual made over time.
Once a online user searches for a destination, Expedia will show a list of hotels for that destination. This leads to the multiple entries (page views) in the original data for that search. However, for our purpose, we will do a summary of the information displayed to the user on that page, and call it \((S_t, A_t)\).
- The \(S_t\) here includes some information of the hotel, user, search context, and competitors.
- The \(A_t\) includes information about how the website is ranking the hotel list. Keep in mind that Expedia wants to come up with the best ranking strategy to maximize their revenue. Some variables we summerized are: the correlation between the current ranking and the price of the hotels, the correlation between the current ranking and the review scores of the hotels, etc. These variables can be viewed as a summary of the ranking strategy \(A_t\) used by Expedia for that search.
- Once a user finishes this entire browsing, we will observe some outcome \(R_t\) that indicates whether the user clicked or booked a hotel, and how much spending was made out of this browsing.
Next time another user searches for the same destination, Expedia will again show a list of hotels, potentially using a similar/different ranking strategy \(A_{t+1}\) based on the previous search results. This process continues for multiple searches for the same destination to form a trajectory of observations for that destination.

Our TA Mehrdad has once again prepared a processed version of the data for us. This new format of the data is contained in the sub-folder called “Expedia Data (condensed MDP)” on his GitHub repository. Besides the processed data, you will find a file that explains the data processing idea and summary of the data in that sub-folder. Please read it carefully before your proceed.

This is an open ended question. You are free to explore the data and apply any offline RL methods you have learned in this course to learn an optimal policy for Expedia to rank their hotel list. However, you need to follow these steps:

Start with exploratory data analysis to understand the data structure, distributions etc. so that you can make informed decisions in the following steps.
Among all the variables available in the data, you need to pick one reward variable out of these candidates: total_clicks, total_gross_revenue, gross_revenue_per_night to be your \(R_t\). Note that some of your following analysis will depend on your choice of the reward variable, so please explain your choice clearly. Once you pick your reward variable, the other two candidates should be dropped from the data, since they cannot be used as either state or action.
Next, among all the variables available in the data, you need to pick one variable to be your action variable \(A_t\). Here you have many choices, for example, corr_pos_price, corr_pos_review, total_promotions, etc. Once you decide your action variable, you will keep the rest of candidates as states.
Now you have a dataset with a lot of state variables, one action variable, and one reward variable. We will treat this as an MDP problem and fit RL models.
Starting from here, you can apply any offline RL methods you have learned in this course to learn an optimal policy. You are not required to validate it on the testing data. However, you should demonstrate these aspects of our model:
- You should clearly explain your approach and reasoning and your choices every step of the way. This includes all details such as hyper-parameter tuning, model selection, convergence diagnostics, etc.
- Among the high-dimensional state variables, you should perform some variable selection or dimension reduction to identify the most important state variables that affect the Q-function and the optimal policy.
- After obtaining the final model, you should provide some interpretation of the results via tables and figures. For example, you may visualize how the optimal action changes with respect to some state variables. And what do they mean in practice. Note that the signals in this dataset could be relatively weak. Hence, you may need to be careful in your analysis and interpretation.
- You should also discuss the limitations of your analysis and potential improvements that can be made in the future.

Stat 546 Homework 6

Assigned: Nov 17, 2025; Due: 11:59 PM CT, Dec 10, 2025

Instruction

Question 1 [40 pts]: Creating Your Own Tabular Dynamics

Question 2 [40 pts]: Fitted Q-Iteration for Continuous Actions

Question 3 [80 pts]: The Expedia Hotel Search Data Mini Project