Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf. For example,
HW01_rqzhu.pdf. Please note that this must be a
.pdf file. .html format
cannot be accepted since they may not be readable in
Gradescope. All proofs must be typed in LaTeX format. Make all of your
R code chunks visible for grading..Rmd file
as a template, be sure to remove this instruction
section.R version \(\geq 4.0.0\).
This will ensure your random seed generation is the same as everyone
else. Please note that updating the R version may require
you to reinstall all of your packages. In the markdown file, set
seed properly so that the results can be reproduced.This homework contains a mini project for individualized decision making and conditional average treatment effect estimation. We will use a real dataset and apply the methods learned in class to analyze the data and draw conclusions. Some parts of this homework can be open ended. You are encouraged to explore and be creative. Some questions may also be computationally intensive, depends on your choices of methods and parameters. Hence you may decide your approach accordingly, as long as you can answer the questions and justify your choices.
The dataset we will be using comes from Kaggle Personalize Expedia Hotel Searches. Please be careful that our research goal is different from the competition goal, and you should read the description of the data variables carefully before you proceed. Here is what we want to do:
srch_destination_id from the
original training data, and for each destination id, we sampled one
prop_id (hotel) and included all observations for that
(destination \(\times\)
hotel) combination. This is to avoid interference as much
as possible since multiple hotels of the same search are naturally
competing against each other, making some of our assumptions
invalid.promotion_flag as the treatment variable.click_bool, gross_bookings_usd, and
booking_bool which indicates the results of the
search.promotion_flagThe following code loads the data.
traindata = read.csv("expedia_train_rl.csv")
colnames(traindata)
## [1] "prop_id" "srch_destination_id"
## [3] "time_idx" "srch_length_of_stay"
## [5] "srch_room_count" "srch_saturday_night_bool"
## [7] "prop_location_score1" "prop_location_score2"
## [9] "prop_log_historical_price" "prop_review_score"
## [11] "prop_starrating" "random_bool"
## [13] "position" "price_usd"
## [15] "promotion_flag" "gross_bookings_usd"
## [17] "booking_bool" "click_bool"
## [19] "comp_rate" "comp_inv"
length(unique(traindata$srch_destination_id))
## [1] 600
length(unique(traindata$prop_id))
## [1] 587
testdata = read.csv("expedia_test_rl.csv")
length(unique(testdata$srch_destination_id))
## [1] 200
length(unique(testdata$prop_id))
## [1] 199
# difference in means estimator
mean(traindata$gross_bookings_usd[traindata$promotion_flag==1]) -
mean(traindata$gross_bookings_usd[traindata$promotion_flag==0])
## [1] 0.1007867
mean(testdata$gross_bookings_usd[testdata$promotion_flag==1]) -
mean(testdata$gross_bookings_usd[testdata$promotion_flag==0])
## [1] 0.01286095
We have learned many assumptions related to causal inference and conditional average treatment effects in class. After carefully reading the documentation of this data, do you think this dataset satisfies assumptions we need? Please discuss each assumption that you think are relevant to our problem (CATE estimation and individualized decision making), and whether you think the assumption is likely to hold or not. You should provide justifications for your answers. Keep in mind that regardless of the validity of the assumptions, we will still proceed to analyze the data in later questions, and moving forward, we will be treat each row of this data as independent observations.
One of the easiest ways to estimate CATE is the X-learner. Please
implement the X-learner to estimate the CATE of
promotion_flag on gross_bookings_usd.
We have learned several other methods for estimating the CATE. Please
implement one of them (other than the X-learner and DR-learner) to
estimate the CATE of promotion_flag on
booking_bool (a binary outcome). Keep in mind that the goal
of this question is to understand what are the important variables that
affect the decision of whether to provide the promotion or not. Complete
the following:
gross_bookings_usd or click_bool in your
analysis. If you choose to do some additional data preprocessing, please
provide justifications for your choices.