Stat 546 Homework 5

Instruction
Homework Description
Question 1: Validity of Causal Inference (20 points)
Question 2: The X-learner and DR-learner (50 points)
Question 3: Individualized Decision Making (50 points)

Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

You are required to submit the rendered file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file. .html format cannot be accepted since they may not be readable in Gradescope. All proofs must be typed in LaTeX format. Make all of your R code chunks visible for grading.
Include your Name and NetID in the report.
If you use this file or the example homework .Rmd file as a template, be sure to remove this instruction section.
On random seed and reproducibility: You should use R version \(\geq 4.0.0\). This will ensure your random seed generation is the same as everyone else. Please note that updating the R version may require you to reinstall all of your packages. In the markdown file, set seed properly so that the results can be reproduced.
For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.
Using AI tools: you should provide a brief statement on the involvement of AI tools on your homework.

Homework Description

This homework contains a mini project for individualized decision making and conditional average treatment effect estimation. We will use a real dataset and apply the methods learned in class to analyze the data and draw conclusions. Some parts of this homework can be open ended. You are encouraged to explore and be creative. Some questions may also be computationally intensive, depends on your choices of methods and parameters. Hence you may decide your approach accordingly, as long as you can answer the questions and justify your choices.

The dataset we will be using comes from Kaggle Personalize Expedia Hotel Searches. Please be careful that our research goal is different from the competition goal, and you should read the description of the data variables carefully before you proceed. Here is what we want to do:

Our TA Mehrdad has prepared a processed version of the data for us. The original data is more than 1 GB, hence we provided a small subset of the data using the following approach. The processed data can be found at his GitHub repository here.
- We sampled 600 unique srch_destination_id from the original training data, and for each destination id, we sampled one prop_id (hotel) and included all observations for that (destination \(\times\) hotel) combination. This is to avoid interference as much as possible since multiple hotels of the same search are naturally competing against each other, making some of our assumptions invalid.
- For this subset of the data, we want to consider the binary variable promotion_flag as the treatment variable.
- Continuous variables are already scaled in the data, hence you do not need to do any variable transformation for them.
- There are three outcome variables we can consider: click_bool, gross_bookings_usd, and booking_bool which indicates the results of the search.
- The goal of our homework is to understand the causal effect of promotion_flag

The following code loads the data.

  traindata = read.csv("expedia_train_rl.csv")
  colnames(traindata)

##  [1] "prop_id"                   "srch_destination_id"      
##  [3] "time_idx"                  "srch_length_of_stay"      
##  [5] "srch_room_count"           "srch_saturday_night_bool" 
##  [7] "prop_location_score1"      "prop_location_score2"     
##  [9] "prop_log_historical_price" "prop_review_score"        
## [11] "prop_starrating"           "random_bool"              
## [13] "position"                  "price_usd"                
## [15] "promotion_flag"            "gross_bookings_usd"       
## [17] "booking_bool"              "click_bool"               
## [19] "comp_rate"                 "comp_inv"

  length(unique(traindata$srch_destination_id))

## [1] 600

  length(unique(traindata$prop_id))

## [1] 587

  testdata = read.csv("expedia_test_rl.csv")
  length(unique(testdata$srch_destination_id))

## [1] 200

  length(unique(testdata$prop_id))

## [1] 199

  # difference in means estimator
  mean(traindata$gross_bookings_usd[traindata$promotion_flag==1]) - 
    mean(traindata$gross_bookings_usd[traindata$promotion_flag==0])

## [1] 0.1007867

  mean(testdata$gross_bookings_usd[testdata$promotion_flag==1]) - 
    mean(testdata$gross_bookings_usd[testdata$promotion_flag==0])

## [1] 0.01286095

Question 1: Validity of Causal Inference (20 points)

We have learned many assumptions related to causal inference and conditional average treatment effects in class. After carefully reading the documentation of this data, do you think this dataset satisfies assumptions we need? Please discuss each assumption that you think are relevant to our problem (CATE estimation and individualized decision making), and whether you think the assumption is likely to hold or not. You should provide justifications for your answers. Keep in mind that regardless of the validity of the assumptions, we will still proceed to analyze the data in later questions, and moving forward, we will be treat each row of this data as independent observations.

Question 2: The X-learner and DR-learner (50 points)

One of the easiest ways to estimate CATE is the X-learner. Please implement the X-learner to estimate the CATE of promotion_flag on gross_bookings_usd.

You need to implement at least two methods: one linear model (any type) and one nonparametric model (any type), for the regression steps in the X-learner. You should provide a justification of your choices of methods.
You may need to do some data preprocessing, for example, not all variables are relevant for the analysis. Only include the ones that are meaningful. Provide a justifications for the variables you removed.
How are you treating the discrete and continuous variables in your models? Nominal or ordinal? Please explain your choices.
Once you have the conditional average treatment effects (CATE) estimated for each observation, provide a summary for the average treatment effect (ATE) and discuss if the treatment is beneficial on average.
The DR-learner is another popular method for CATE estimation. Please also implement the DR-learner (see algorithm 1, page 3019 of the original paper) to estimate the CATE and then summarize the ATE for this sampled population. Discussion briefly how the DR-learner is different from the X-learner and why it is able to achieve some additional theoretical properties. What could be the drawbacks of the DR-learner compared to the X-learner in practice?

Question 3: Individualized Decision Making (50 points)

We have learned several other methods for estimating the CATE. Please implement one of them (other than the X-learner and DR-learner) to estimate the CATE of promotion_flag on booking_bool (a binary outcome). Keep in mind that the goal of this question is to understand what are the important variables that affect the decision of whether to provide the promotion or not. Complete the following:

You should use almost the same set of data you used in the previous question. Do not include the outcome variable gross_bookings_usd or click_bool in your analysis. If you choose to do some additional data preprocessing, please provide justifications for your choices.
Implement the method you chose to estimate the CATE. You should provide a brief description of the method and how you choose the tuning parameters (if any).
Now, we will utilize the testing data. Keep in mind that the testing data also only contain the results for one treatment arm (either promotion or no promotion). Hence, we cannot directly evaluate the effect of treatment on each of the testing sample. However, if we can estimate the outcome models (make a good choice) on the testing data, we may still evaluate the performance of our CATE estimates indirectly. Carry out this idea (provide a clear description of your implementation) to evaluate if your estimated CATE matches the estimated CATE on the testing data. Provide figures or tables to summarize and discuss your analysis results.
A researcher is wondering what are the important variables that affect the decision making and based on your analysis, would you recommend Expedia to offer promotions universally or selectively? To answer this question, please provide some additional investigation to see what type of hotels (a potential subset, if exist) would likely to benefit most from the promotion. Again, how could you validate your findings using the testing data? Please provide figures or tables to summarize and discuss your results.