Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: Doubly Robust Estimation and Bootstrap

For this problem, we use the iv_health.csv data provided . The dataset aims to study the effects of having health insurance on medical expenses. For this problem, we are only interested in a particular set of variables. Load the data and restrict the dataset to the following variables. In addition, the ssiratio variable should have values between 0 and 1, which is not true for some entries in the dataset. Correct this by changing those values that are greater than 1 to 1.

Description of Variables
Variable Description
logmedexpense Medical Expenses (log form)
healthinsu Having health insurance (=1) or not (=0)
age Age
logincome Annual Income (log form)
illnesses Severity of Illnesses
ssiratio Ratio of Social Security Income

This dataset was originally from Medical Expenditure Panel Survey and was used as an example here. However, for our purpose, we will assume that we do not have unobserved confounders. After you process the data, perform the following tasks:

  1. Doubly Robust Estimation: Implement the doubly robust estimator on the given dataset. Treat logmedexpense as the outcome \(Y\), healthinsu as the treatment \(A\), and other variables as the covariates \(X\). Estimate the ATE of having health insurance on medical expenses.

  2. Does the result in part a) align with your intuition or understanding of having health insurance? Provide a briefly explain of your thoughts.

  3. Use the idea of bootstrap to estimate the distribution of your doubly robust estimator. For this question, you should use 1000 bootstrap samples. After obtaining the distribution, report the mean and standard deviation of your estimator, provide a histogram of your bootstrap samples, and then calculate a two-sided 95% confidence interval for your estimator. Based on this result, would you reject the following hypothesis of ATE?

\[ \text{H}_0: \tau = 0 \quad \text{vs.} \quad \text{H}_1: \tau \neq 0 \]

Question 2: Optimal Instrument

In our lecture, we provided a statement that

\[ n \text{Var}[\hat{\tau}_\text{iv}^w] = \frac{\text{Var}[\epsilon]\text{Var}[w(Z)]}{[\text{Cov}(A, w(Z)]^2} \]

where \(A\) is the treatment variable, \(Z\) is a multivariate vector of instrument, and \(w(Z)\) is some function of \(Z\). And we also stated that when \(w(Z)\) takes the optimal form

\[ w(z) = \mathbb{E}[A | Z = z] \]

The variance of the IV estimator is simplifed into

\[ n \text{Var}[\hat{\tau}_\text{iv}^w] = \frac{\text{Var}[\epsilon]}{\text{Var}[\mathbb{E}[A | Z]]} \]

Prove this by showing that

\[ \text{Var}[\mathbb{E}[A | Z]] = \text{Cov}[A, E(A|Z)] \]

Hint: start by the definition of covariance and use the law of total expectation.

Question 3: Two-Stage Least Squares (2SLS)

In this problem, we will using the card dataset from wooldridge package.

  library(AER)
  library(wooldridge)
  data("card")

The dataset contains 3030 observations collected in 1966/1976, aiming to estimate the impact of education on wage. For this exercise, we only use the following variables, you can find a description of all variables here.

Description of Variables
Variable Description
lwage Annual wage (log form)
educ Years of education
nearc4 Living close to college (=1) or far from college (=0)
smsa Living in metropolitan area (=1) or not (=0)
exper Years of experience
expersq Years of experience (squared term)
black Black (=1), not black (=0)
south Living in the south (=1) or not (=0)
  1. Among the variables provided above, we are interested in finding the effect of education on wage. Thus, lwage should be selected as the response variable \(Y\), and educ is the treatment \(A\). When estimating the relationship between wage and education, scholars have encountered issues in dealing with the so-called “ability bias”: individuals who have higher ability are more likely both to stay longer in school and get a higher paid job. Ability, however, is difficult to measure and cannot be included in the model. Thus, we consider “ability” as an unobserved confounder. nearc4 would be a good choice as the instrumental variable in this case. Do you think this is a good choice as instrumental variable? Does it satisfy the three requirements of instrumental variables? What could be the potential issues?

  2. Fit a model using linear regression (without using instrumental techniques) by treating lwage as the response variable, and all other variables as predictors. What is the estimated effect of educ in this case after controlling other variables?

  3. Treat nearc4 as the instrumental variable, and implement the two-stage least square (2SLS) to fit the model. Implement the 2SLS method using both your own code, and AER package, does the result match? What is the estimated effect of educ? How does it compare with linear regression, what does this suggest?