STAT 546 Homework 2

Instruction
Question 1: Doubly Robust Estimation and Bootstrap
Question 2: Optimal Instrument
Question 3: Two-Stage Least Squares (2SLS)

Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

You are required to submit the rendered file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file. .html format cannot be accepted. Make all of your R code chunks visible for grading.
Include your Name and NetID in the report.
If you use this file or the example homework .Rmd file as a template, be sure to remove this instruction section.
Make sure that you set seed properly so that the results can be replicated if needed.
For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.
When using AI tools, you are encouraged to document your comment on your experience with AI tools especially when it’s difficult for them to grasp the idea of the question.
On random seed and reproducibility: Make sure the version of your R is \(\geq 4.0.0\). This will ensure your random seed generation is the same as everyone else. Please note that updating the R version may require you to re-install all of your packages.

Question 1: Doubly Robust Estimation and Bootstrap

For this problem, we use the iv_health.csv data provided . The dataset aims to study the effects of having health insurance on medical expenses. For this problem, we are only interested in a particular set of variables. Load the data and restrict the dataset to the following variables. In addition, the ssiratio variable should have values between 0 and 1, which is not true for some entries in the dataset. Correct this by changing those values that are greater than 1 to 1.

Description of Variables
Variable	Description
logmedexpense	Medical Expenses (log form)
healthinsu	Having health insurance (=1) or not (=0)
age	Age
logincome	Annual Income (log form)
illnesses	Severity of Illnesses
ssiratio	Ratio of Social Security Income

This dataset was originally from Medical Expenditure Panel Survey and was used as an example here. However, for our purpose, we will assume that we do not have unobserved confounders. After you process the data, perform the following tasks:

Doubly Robust Estimation: Implement the doubly robust estimator on the given dataset. Treat logmedexpense as the outcome \(Y\), healthinsu as the treatment \(A\), and other variables as the covariates \(X\). Estimate the ATE of having health insurance on medical expenses.
Does the result in part a) align with your intuition or understanding of having health insurance? Provide a briefly explain of your thoughts.
Use the idea of bootstrap to estimate the distribution of your doubly robust estimator. For this question, you should use 1000 bootstrap samples. After obtaining the distribution, report the mean and standard deviation of your estimator, provide a histogram of your bootstrap samples, and then calculate a two-sided 95% confidence interval for your estimator. Based on this result, would you reject the following hypothesis of ATE?

\[ \text{H}_0: \tau = 0 \quad \text{vs.} \quad \text{H}_1: \tau \neq 0 \]

Answer:

Part a)

We load and pre-process the data first.

  iv_health = read.csv("..\\dataset\\iv_health.csv")
  iv_health$ssiratio[iv_health$ssiratio>1] = 1
  head(iv_health)

  ATE_Est <- function(a, y, x, 
                      truncps = c(0, 1)) # truncate propensity score if needed (positivity assumption)
  {
    ate_data = data.frame("a" = a, "y" = y, x)
    
    # fitted propensity score
    pscore <- glm(a ~ . - y, data = ate_data, family = "binomial")$fitted.values
    pscore <- pmax(truncps[1], pmin(truncps[2], pscore))
    
    # fitted potential outcomes
    # weights simply restrict to part of the data based on the label by still predict all subjects
    outcome1 <- lm(y ~ . - a, data = ate_data, weights = a)$fitted.values
    outcome0 <- lm(y ~ . - a, data = ate_data, weights = (1 - a))$fitted.values
    
    ate_reg <- mean(outcome1 - outcome0)
    
    # doubly robust estimator
    res1 <- y - outcome1
    res0 <- y - outcome0
    r_treat <- mean(a * res1 / pscore) # residual correction term for treated
    r_control <- mean((1 - a) * res0 / (1 - pscore)) # residual correction term for control
    ate_dr <- ate_reg + r_treat - r_control
    
    return(ate_dr)
  }

  # subset the dataset
  Y = iv_health$logmedexpense
  A = iv_health$healthinsu
  X = cbind(iv_health$age,
            iv_health$logincome,
            iv_health$illnesses,
            iv_health$ssiratio)

  # estimate the ATE
  ATE_Est(A, Y, X)

## [1] 0.1042567

Part b)

The estimated Average Treatment Effect (ATE) of having health insurance is 0.10, which is a bit counter-intuitive. One would assume that having health insurance would lower medical costs. However, this may be due to unmeasured confounders in this setting, such as the willingness to seek medical help. For example, patients without health insurance are less likely to seek medical assistance even when ill. Such phenomena may not be captured by the available dataset.

Part c)

We perform 1000 bootstrap samples to estimate the distribution of the doubly robust estimator.

  set.seed(1)
  nsim = 1000
  ate_rec = rep(0, nsim)
  
  for (i in 1:nsim){
    ind = sample(1:length(Y), replace = TRUE)
    ate_rec[i] = ATE_Est(A[ind], Y[ind], X[ind,])
  }
  
  hist(ate_rec)

  mean(ate_rec)

## [1] 0.103926

  sd(ate_rec)

## [1] 0.02668838

  # a 95% confidence interval
  quantile(ate_rec, c(0.025, 0.975))

##       2.5%      97.5% 
## 0.04915587 0.15412082

We would reject the null hypothesis of \(\tau = 0\), as the 95% confidence interval does not contain 0.

Question 2: Optimal Instrument

In our lecture, we provided a statement that

\[ n \text{Var}[\hat{\tau}_\text{iv}^w] = \frac{\text{Var}[\epsilon]\text{Var}[w(Z)]}{[\text{Cov}(A, w(Z)]^2} \]

where \(A\) is the treatment variable, \(Z\) is a multivariate vector of instrument, and \(w(Z)\) is some function of \(Z\). And we also stated that when \(w(Z)\) takes the optimal form

\[ w(z) = \mathbb{E}[A | Z = z] \]

The variance of the IV estimator is simplifed into

\[ n \text{Var}[\hat{\tau}_\text{iv}^w] = \frac{\text{Var}[\epsilon]}{\text{Var}[\mathbb{E}[A | Z]]} \]

Prove this by showing that

\[ \text{Var}[\mathbb{E}[A | Z]] = \text{Cov}[A, E(A|Z)] \]

Hint: start by the definition of covariance and use the law of total expectation.

Answer:

By the definition, we have:

\[ \text{Cov}(A, E(A|Z)) = E[(A - E[A])(E(A|Z) - E[E(A|Z)])]. \]

Since \(E[E(A|Z)] = E[A]\) due to the law of total expectation, this simplifies to:

\[ \begin{aligned} & E[(A - E[A])(E(A|Z) - E[A])]\\ =& E[ A \cdot E(A|Z) - A \cdot E[A] - E[A] \cdot E(A|Z) + E[A] \cdot E[A] ]\\ =& E[ A \cdot E(A|Z) ] - E[A]^2\\ =& E[ E(A|Z)^2 ] - E[A]^2 \\ =& E[ E(A|Z)^2 ] - E[E[A|Z]]^2 \\ =& \text{Var}(E(A|Z)) \end{aligned} \]

Question 3: Two-Stage Least Squares (2SLS)

In this problem, we will using the card dataset from wooldridge package.

  library(AER)
  library(wooldridge)
  data("card")

The dataset contains 3030 observations collected in 1966/1976, aiming to estimate the impact of education on wage. For this exercise, we only use the following variables, you can find a description of all variables here.

Description of Variables
Variable	Description
lwage	Annual wage (log form)
educ	Years of education
nearc4	Living close to college (=1) or far from college (=0)
smsa	Living in metropolitan area (=1) or not (=0)
exper	Years of experience
expersq	Years of experience (squared term)
black	Black (=1), not black (=0)
south	Living in the south (=1) or not (=0)

Among the variables provided above, we are interested in finding the effect of education on wage. Thus, lwage should be selected as the response variable \(Y\), and educ is the treatment \(A\). When estimating the relationship between wage and education, scholars have encountered issues in dealing with the so-called “ability bias”: individuals who have higher ability are more likely both to stay longer in school and get a higher paid job. Ability, however, is difficult to measure and cannot be included in the model. Thus, we consider “ability” as an unobserved confounder. nearc4 would be a good choice as the instrumental variable in this case. Do you think this is a good choice as instrumental variable? Does it satisfy the three requirements of instrumental variables? What could be the potential issues?
Fit a model using linear regression (without using instrumental techniques) by treating lwage as the response variable, and all other variables as predictors. What is the estimated effect of educ in this case after controlling other variables?
Treat nearc4 as the instrumental variable, and implement the two-stage least square (2SLS) to fit the model. Implement the 2SLS method using both your own code, and AER package, does the result match? What is the estimated effect of educ? How does it compare with linear regression, what does this suggest?

Answer:

Part a)

It is correlated with the treatment: Individuals who live close to a 4-year college have easier access to education at a lower costs if attending local schools (less commuting costs and time and also less accommodation costs), so they have greater incentives to pursue education.
It is (most likely) not correlated with the unobserved variable: Proximity to a college is not likely to be correlated with your intellectual ability.
It is correlated to the dependent variable only through the treatment variable: Proximity to a college has no effect on your annual income, unless you decide to pursue further education because of the nearby college.

Hence, overall, this would be a good choice as an instrumental variable. But there are always potential violations to these assumptions though ways we have not thought of.

Part b)

To fit a naive linear regression without considering endogeneity, we have

  ls.model <- lm(lwage ~educ +smsa66 +exper+expersq +black +south66, data = card)
  summary(ls.model)

## 
## Call:
## lm(formula = lwage ~ educ + smsa66 + exper + expersq + black + 
##     south66, data = card)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.64702 -0.22649  0.02429  0.25341  1.34745 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.7313993  0.0686437  68.927  < 2e-16 ***
## educ         0.0762313  0.0035523  21.459  < 2e-16 ***
## smsa66       0.1127386  0.0150959   7.468 1.06e-13 ***
## exper        0.0852256  0.0067429  12.639  < 2e-16 ***
## expersq     -0.0023807  0.0003222  -7.388 1.92e-13 ***
## black       -0.1769862  0.0184776  -9.578  < 2e-16 ***
## south66     -0.0960107  0.0160809  -5.970 2.64e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3797 on 3003 degrees of freedom
## Multiple R-squared:  0.2695, Adjusted R-squared:  0.268 
## F-statistic: 184.6 on 6 and 3003 DF,  p-value: < 2.2e-16

OLS estimation yields an estimate of 7.6% per year for returns to schooling.

Part c)

By considering the instrumental variable, we have

  first.stage = lm(educ ~ nearc4 + smsa66 + exper + expersq + black + south66, data = card)
  
  x1_hat <- fitted( first.stage ) #Save predicted values for education
  
  second.stage <- lm(lwage ~x1_hat + smsa66 + exper + expersq + black + south66, data = card)

  summary(second.stage)

## 
## Call:
## lm(formula = lwage ~ x1_hat + smsa66 + exper + expersq + black + 
##     south66, data = card)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.50984 -0.26048  0.01981  0.27560  1.46907 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.3572316  0.9261658   3.625 0.000294 ***
## x1_hat       0.1572177  0.0545440   2.882 0.003975 ** 
## smsa66       0.0809666  0.0267911   3.022 0.002531 ** 
## exper        0.1183517  0.0234011   5.058 4.50e-07 ***
## expersq     -0.0024146  0.0003463  -6.972 3.82e-12 ***
## black       -0.1035834  0.0531482  -1.949 0.051394 .  
## south66     -0.0637134  0.0277178  -2.299 0.021593 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4072 on 3003 degrees of freedom
## Multiple R-squared:  0.1598, Adjusted R-squared:  0.1581 
## F-statistic: 95.16 on 6 and 3003 DF,  p-value: < 2.2e-16

  ivreg.model <- ivreg(lwage ~ educ + smsa66 + exper + expersq + black + south66 |.-educ+nearc4,data = card )  
  summary(ivreg.model)

## 
## Call:
## ivreg(formula = lwage ~ educ + smsa66 + exper + expersq + black + 
##     south66 | . - educ + nearc4, data = card)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.96947 -0.25256  0.02019  0.26943  1.48762 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.3572316  0.9353442   3.589 0.000337 ***
## educ         0.1572177  0.0550845   2.854 0.004345 ** 
## smsa66       0.0809666  0.0270566   2.992 0.002790 ** 
## exper        0.1183517  0.0236330   5.008 5.82e-07 ***
## expersq     -0.0024146  0.0003498  -6.904 6.16e-12 ***
## black       -0.1035834  0.0536749  -1.930 0.053722 .  
## south66     -0.0637134  0.0279925  -2.276 0.022911 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4112 on 3003 degrees of freedom
## Multiple R-Squared: 0.143,   Adjusted R-squared: 0.1413 
## Wald test:  93.3 on 6 and 3003 DF,  p-value: < 2.2e-16

2SLS estimation yields an estimate of 15.72% per year for returns to schooling. It is significantly higher than the OLS estimates, which suggest that the wage increase caused by education may be understated if we do not consider unobserved confounders.

STAT 546 Homework 2

Assigned: Mar 18, 2024; Due: 11:59 PM CT, Mar 28, 2024

Instruction

Question 1: Doubly Robust Estimation and Bootstrap

Question 2: Optimal Instrument

Question 3: Two-Stage Least Squares (2SLS)