Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Homework Description

This homework contains a mini project for individualized decision making and conditional average treatment effect estimation. We will use a real dataset and apply the methods learned in class to analyze the data and draw conclusions. Some parts of this homework can be open ended. You are encouraged to explore and be creative. Some questions may also be computationally intensive, depends on your choices of methods and parameters. Hence you may decide your approach accordingly, as long as you can answer the questions and justify your choices.

The dataset we will be using comes from Kaggle Personalize Expedia Hotel Searches. Please be careful that our research goal is different from the competition goal, and you should read the description of the data variables carefully before you proceed. Here is what we want to do:

The following code loads the data.

  traindata = read.csv("expedia_train_rl.csv")
  colnames(traindata)
##  [1] "prop_id"                   "srch_destination_id"      
##  [3] "time_idx"                  "srch_length_of_stay"      
##  [5] "srch_room_count"           "srch_saturday_night_bool" 
##  [7] "prop_location_score1"      "prop_location_score2"     
##  [9] "prop_log_historical_price" "prop_review_score"        
## [11] "prop_starrating"           "random_bool"              
## [13] "position"                  "price_usd"                
## [15] "promotion_flag"            "gross_bookings_usd"       
## [17] "booking_bool"              "click_bool"               
## [19] "comp_rate"                 "comp_inv"
  length(unique(traindata$srch_destination_id))
## [1] 600
  length(unique(traindata$prop_id))
## [1] 587
  testdata = read.csv("expedia_test_rl.csv")
  length(unique(testdata$srch_destination_id))
## [1] 200
  length(unique(testdata$prop_id))
## [1] 199
  # difference in means estimator
  mean(traindata$gross_bookings_usd[traindata$promotion_flag==1]) - 
    mean(traindata$gross_bookings_usd[traindata$promotion_flag==0])
## [1] 0.1007867
  mean(testdata$gross_bookings_usd[testdata$promotion_flag==1]) - 
    mean(testdata$gross_bookings_usd[testdata$promotion_flag==0])
## [1] 0.01286095

Question 1: Validity of Causal Inference (20 points)

We have learned many assumptions related to causal inference and conditional average treatment effects in class. After carefully reading the documentation of this data, do you think this dataset satisfies assumptions we need? Please discuss each assumption that you think are relevant to our problem (CATE estimation and individualized decision making), and whether you think the assumption is likely to hold or not. You should provide justifications for your answers. Keep in mind that regardless of the validity of the assumptions, we will still proceed to analyze the data in later questions, and moving forward, we will be treat each row of this data as independent observations.

Question 2: The X-learner and DR-learner (50 points)

One of the easiest ways to estimate CATE is the X-learner. Please implement the X-learner to estimate the CATE of promotion_flag on gross_bookings_usd.

Question 3: Individualized Decision Making (50 points)

We have learned several other methods for estimating the CATE. Please implement one of them (other than the X-learner and DR-learner) to estimate the CATE of promotion_flag on booking_bool (a binary outcome). Keep in mind that the goal of this question is to understand what are the important variables that affect the decision of whether to provide the promotion or not. Complete the following: