STAT 546 Homework 2

Instruction
Question 1: Doubly Robust Estimation and Bootstrap
Question 2: Optimal Instrument
Question 3: Two-Stage Least Squares (2SLS)

Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

You are required to submit the rendered file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file. .html format cannot be accepted. Make all of your R code chunks visible for grading.
Include your Name and NetID in the report.
If you use this file or the example homework .Rmd file as a template, be sure to remove this instruction section.
Make sure that you set seed properly so that the results can be replicated if needed.
For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.
When using AI tools, you are encouraged to document your comment on your experience with AI tools especially when it’s difficult for them to grasp the idea of the question.
On random seed and reproducibility: Make sure the version of your R is \(\geq 4.0.0\). This will ensure your random seed generation is the same as everyone else. Please note that updating the R version may require you to re-install all of your packages.

Question 1: Doubly Robust Estimation and Bootstrap

For this problem, we use the iv_health.csv data provided . The dataset aims to study the effects of having health insurance on medical expenses. For this problem, we are only interested in a particular set of variables. Load the data and restrict the dataset to the following variables. In addition, the ssiratio variable should have values between 0 and 1, which is not true for some entries in the dataset. Correct this by changing those values that are greater than 1 to 1.

Description of Variables
Variable	Description
logmedexpense	Medical Expenses (log form)
healthinsu	Having health insurance (=1) or not (=0)
age	Age
logincome	Annual Income (log form)
illnesses	Severity of Illnesses
ssiratio	Ratio of Social Security Income

This dataset was originally from Medical Expenditure Panel Survey and was used as an example here. However, for our purpose, we will assume that we do not have unobserved confounders. After you process the data, perform the following tasks:

Doubly Robust Estimation: Implement the doubly robust estimator on the given dataset. Treat logmedexpense as the outcome \(Y\), healthinsu as the treatment \(A\), and other variables as the covariates \(X\). Estimate the ATE of having health insurance on medical expenses.
Does the result in part a) align with your intuition or understanding of having health insurance? Provide a briefly explain of your thoughts.
Use the idea of bootstrap to estimate the distribution of your doubly robust estimator. For this question, you should use 1000 bootstrap samples. After obtaining the distribution, report the mean and standard deviation of your estimator, provide a histogram of your bootstrap samples, and then calculate a two-sided 95% confidence interval for your estimator. Based on this result, would you reject the following hypothesis of ATE?

\[ \text{H}_0: \tau = 0 \quad \text{vs.} \quad \text{H}_1: \tau \neq 0 \]

Question 2: Optimal Instrument

In our lecture, we provided a statement that

\[ n \text{Var}[\hat{\tau}_\text{iv}^w] = \frac{\text{Var}[\epsilon]\text{Var}[w(Z)]}{[\text{Cov}(A, w(Z)]^2} \]

where \(A\) is the treatment variable, \(Z\) is a multivariate vector of instrument, and \(w(Z)\) is some function of \(Z\). And we also stated that when \(w(Z)\) takes the optimal form

\[ w(z) = \mathbb{E}[A | Z = z] \]

The variance of the IV estimator is simplifed into

\[ n \text{Var}[\hat{\tau}_\text{iv}^w] = \frac{\text{Var}[\epsilon]}{\text{Var}[\mathbb{E}[A | Z]]} \]

Prove this by showing that

\[ \text{Var}[\mathbb{E}[A | Z]] = \text{Cov}[A, E(A|Z)] \]

Hint: start by the definition of covariance and use the law of total expectation.

Question 3: Two-Stage Least Squares (2SLS)

In this problem, we will using the card dataset from wooldridge package.

  library(AER)
  library(wooldridge)
  data("card")

The dataset contains 3030 observations collected in 1966/1976, aiming to estimate the impact of education on wage. For this exercise, we only use the following variables, you can find a description of all variables here.

Description of Variables
Variable	Description
lwage	Annual wage (log form)
educ	Years of education
nearc4	Living close to college (=1) or far from college (=0)
smsa	Living in metropolitan area (=1) or not (=0)
exper	Years of experience
expersq	Years of experience (squared term)
black	Black (=1), not black (=0)
south	Living in the south (=1) or not (=0)

Among the variables provided above, we are interested in finding the effect of education on wage. Thus, lwage should be selected as the response variable \(Y\), and educ is the treatment \(A\). When estimating the relationship between wage and education, scholars have encountered issues in dealing with the so-called “ability bias”: individuals who have higher ability are more likely both to stay longer in school and get a higher paid job. Ability, however, is difficult to measure and cannot be included in the model. Thus, we consider “ability” as an unobserved confounder. nearc4 would be a good choice as the instrumental variable in this case. Do you think this is a good choice as instrumental variable? Does it satisfy the three requirements of instrumental variables? What could be the potential issues?
Fit a model using linear regression (without using instrumental techniques) by treating lwage as the response variable, and all other variables as predictors. What is the estimated effect of educ in this case after controlling other variables?
Treat nearc4 as the instrumental variable, and implement the two-stage least square (2SLS) to fit the model. Implement the 2SLS method using both your own code, and AER package, does the result match? What is the estimated effect of educ? How does it compare with linear regression, what does this suggest?

STAT 546 Homework 2

Assigned: Mar 18, 2023; Due: 11:59 PM CT, Mar 28, 2023

Instruction

Question 1: Doubly Robust Estimation and Bootstrap

Question 2: Optimal Instrument

Question 3: Two-Stage Least Squares (2SLS)