Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to re-install all of your packages.For this problem, we use the iv_health.csv
data provided
. The dataset aims to study the effects of having health insurance on
medical expenses. For this problem, we are only interested in a
particular set of variables. Load the data and restrict the dataset to
the following variables. In addition, the ssiratio
variable
should have values between 0 and 1, which is not true for some entries
in the dataset. Correct this by changing those values
that are greater than 1 to 1.
Variable | Description |
---|---|
logmedexpense | Medical Expenses (log form) |
healthinsu | Having health insurance (=1) or not (=0) |
age | Age |
logincome | Annual Income (log form) |
illnesses | Severity of Illnesses |
ssiratio | Ratio of Social Security Income |
This dataset was originally from Medical Expenditure Panel Survey and was used as an example here. However, for our purpose, we will assume that we do not have unobserved confounders. After you process the data, perform the following tasks:
Doubly Robust Estimation: Implement the doubly robust estimator
on the given dataset. Treat logmedexpense
as the outcome
\(Y\), healthinsu
as the
treatment \(A\), and other variables as
the covariates \(X\). Estimate the ATE
of having health insurance on medical expenses.
Does the result in part a) align with your intuition or understanding of having health insurance? Provide a briefly explain of your thoughts.
Use the idea of bootstrap to estimate the distribution of your doubly robust estimator. For this question, you should use 1000 bootstrap samples. After obtaining the distribution, report the mean and standard deviation of your estimator, provide a histogram of your bootstrap samples, and then calculate a two-sided 95% confidence interval for your estimator. Based on this result, would you reject the following hypothesis of ATE?
\[ \text{H}_0: \tau = 0 \quad \text{vs.} \quad \text{H}_1: \tau \neq 0 \]
Answer:
Part a)
We load and pre-process the data first.
iv_health = read.csv("..\\dataset\\iv_health.csv")
iv_health$ssiratio[iv_health$ssiratio>1] = 1
head(iv_health)
ATE_Est <- function(a, y, x,
truncps = c(0, 1)) # truncate propensity score if needed (positivity assumption)
{
ate_data = data.frame("a" = a, "y" = y, x)
# fitted propensity score
pscore <- glm(a ~ . - y, data = ate_data, family = "binomial")$fitted.values
pscore <- pmax(truncps[1], pmin(truncps[2], pscore))
# fitted potential outcomes
# weights simply restrict to part of the data based on the label by still predict all subjects
outcome1 <- lm(y ~ . - a, data = ate_data, weights = a)$fitted.values
outcome0 <- lm(y ~ . - a, data = ate_data, weights = (1 - a))$fitted.values
ate_reg <- mean(outcome1 - outcome0)
# doubly robust estimator
res1 <- y - outcome1
res0 <- y - outcome0
r_treat <- mean(a * res1 / pscore) # residual correction term for treated
r_control <- mean((1 - a) * res0 / (1 - pscore)) # residual correction term for control
ate_dr <- ate_reg + r_treat - r_control
return(ate_dr)
}
# subset the dataset
Y = iv_health$logmedexpense
A = iv_health$healthinsu
X = cbind(iv_health$age,
iv_health$logincome,
iv_health$illnesses,
iv_health$ssiratio)
# estimate the ATE
ATE_Est(A, Y, X)
## [1] 0.1042567
Part b)
The estimated Average Treatment Effect (ATE) of having health insurance is 0.10, which is a bit counter-intuitive. One would assume that having health insurance would lower medical costs. However, this may be due to unmeasured confounders in this setting, such as the willingness to seek medical help. For example, patients without health insurance are less likely to seek medical assistance even when ill. Such phenomena may not be captured by the available dataset.
Part c)
We perform 1000 bootstrap samples to estimate the distribution of the doubly robust estimator.
set.seed(1)
nsim = 1000
ate_rec = rep(0, nsim)
for (i in 1:nsim){
ind = sample(1:length(Y), replace = TRUE)
ate_rec[i] = ATE_Est(A[ind], Y[ind], X[ind,])
}
hist(ate_rec)
mean(ate_rec)
## [1] 0.103926
sd(ate_rec)
## [1] 0.02668838
# a 95% confidence interval
quantile(ate_rec, c(0.025, 0.975))
## 2.5% 97.5%
## 0.04915587 0.15412082
We would reject the null hypothesis of \(\tau = 0\), as the 95% confidence interval does not contain 0.
In our lecture, we provided a statement that
\[ n \text{Var}[\hat{\tau}_\text{iv}^w] = \frac{\text{Var}[\epsilon]\text{Var}[w(Z)]}{[\text{Cov}(A, w(Z)]^2} \]
where \(A\) is the treatment variable, \(Z\) is a multivariate vector of instrument, and \(w(Z)\) is some function of \(Z\). And we also stated that when \(w(Z)\) takes the optimal form
\[ w(z) = \mathbb{E}[A | Z = z] \]
The variance of the IV estimator is simplifed into
\[ n \text{Var}[\hat{\tau}_\text{iv}^w] = \frac{\text{Var}[\epsilon]}{\text{Var}[\mathbb{E}[A | Z]]} \]
Prove this by showing that
\[ \text{Var}[\mathbb{E}[A | Z]] = \text{Cov}[A, E(A|Z)] \]
Hint: start by the definition of covariance and use the law of total expectation.
Answer:
By the definition, we have:
\[ \text{Cov}(A, E(A|Z)) = E[(A - E[A])(E(A|Z) - E[E(A|Z)])]. \]
Since \(E[E(A|Z)] = E[A]\) due to the law of total expectation, this simplifies to:
\[ \begin{aligned} & E[(A - E[A])(E(A|Z) - E[A])]\\ =& E[ A \cdot E(A|Z) - A \cdot E[A] - E[A] \cdot E(A|Z) + E[A] \cdot E[A] ]\\ =& E[ A \cdot E(A|Z) ] - E[A]^2\\ =& E[ E(A|Z)^2 ] - E[A]^2 \\ =& E[ E(A|Z)^2 ] - E[E[A|Z]]^2 \\ =& \text{Var}(E(A|Z)) \end{aligned} \]
In this problem, we will using the card dataset from wooldridge package.
library(AER)
library(wooldridge)
data("card")
The dataset contains 3030 observations collected in 1966/1976, aiming to estimate the impact of education on wage. For this exercise, we only use the following variables, you can find a description of all variables here.
Variable | Description |
---|---|
lwage | Annual wage (log form) |
educ | Years of education |
nearc4 | Living close to college (=1) or far from college (=0) |
smsa | Living in metropolitan area (=1) or not (=0) |
exper | Years of experience |
expersq | Years of experience (squared term) |
black | Black (=1), not black (=0) |
south | Living in the south (=1) or not (=0) |
Among the variables provided above, we are interested in finding
the effect of education on wage. Thus, lwage
should be
selected as the response variable \(Y\), and educ
is the treatment
\(A\). When estimating the relationship
between wage and education, scholars have encountered issues in dealing
with the so-called “ability bias”: individuals who have higher ability
are more likely both to stay longer in school and get a higher paid job.
Ability, however, is difficult to measure and cannot be included in the
model. Thus, we consider “ability” as an unobserved confounder.
nearc4
would be a good choice as the instrumental variable
in this case. Do you think this is a good choice as instrumental
variable? Does it satisfy the three requirements of instrumental
variables? What could be the potential issues?
Fit a model using linear regression (without using instrumental
techniques) by treating lwage as the response variable,
and all other variables as predictors. What is the
estimated effect of educ
in this case after controlling
other variables?
Treat nearc4
as the instrumental variable, and
implement the two-stage least square (2SLS) to fit the model. Implement
the 2SLS method using both your own code, and AER package, does
the result match? What is the estimated effect of educ
? How
does it compare with linear regression, what does this suggest?
Answer:
Part a)
Hence, overall, this would be a good choice as an instrumental variable. But there are always potential violations to these assumptions though ways we have not thought of.
Part b)
To fit a naive linear regression without considering endogeneity, we have
ls.model <- lm(lwage ~educ +smsa66 +exper+expersq +black +south66, data = card)
summary(ls.model)
##
## Call:
## lm(formula = lwage ~ educ + smsa66 + exper + expersq + black +
## south66, data = card)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.64702 -0.22649 0.02429 0.25341 1.34745
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7313993 0.0686437 68.927 < 2e-16 ***
## educ 0.0762313 0.0035523 21.459 < 2e-16 ***
## smsa66 0.1127386 0.0150959 7.468 1.06e-13 ***
## exper 0.0852256 0.0067429 12.639 < 2e-16 ***
## expersq -0.0023807 0.0003222 -7.388 1.92e-13 ***
## black -0.1769862 0.0184776 -9.578 < 2e-16 ***
## south66 -0.0960107 0.0160809 -5.970 2.64e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3797 on 3003 degrees of freedom
## Multiple R-squared: 0.2695, Adjusted R-squared: 0.268
## F-statistic: 184.6 on 6 and 3003 DF, p-value: < 2.2e-16
OLS estimation yields an estimate of 7.6% per year for returns to schooling.
Part c)
By considering the instrumental variable, we have
first.stage = lm(educ ~ nearc4 + smsa66 + exper + expersq + black + south66, data = card)
x1_hat <- fitted( first.stage ) #Save predicted values for education
second.stage <- lm(lwage ~x1_hat + smsa66 + exper + expersq + black + south66, data = card)
summary(second.stage)
##
## Call:
## lm(formula = lwage ~ x1_hat + smsa66 + exper + expersq + black +
## south66, data = card)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.50984 -0.26048 0.01981 0.27560 1.46907
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.3572316 0.9261658 3.625 0.000294 ***
## x1_hat 0.1572177 0.0545440 2.882 0.003975 **
## smsa66 0.0809666 0.0267911 3.022 0.002531 **
## exper 0.1183517 0.0234011 5.058 4.50e-07 ***
## expersq -0.0024146 0.0003463 -6.972 3.82e-12 ***
## black -0.1035834 0.0531482 -1.949 0.051394 .
## south66 -0.0637134 0.0277178 -2.299 0.021593 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4072 on 3003 degrees of freedom
## Multiple R-squared: 0.1598, Adjusted R-squared: 0.1581
## F-statistic: 95.16 on 6 and 3003 DF, p-value: < 2.2e-16
ivreg.model <- ivreg(lwage ~ educ + smsa66 + exper + expersq + black + south66 |.-educ+nearc4,data = card )
summary(ivreg.model)
##
## Call:
## ivreg(formula = lwage ~ educ + smsa66 + exper + expersq + black +
## south66 | . - educ + nearc4, data = card)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.96947 -0.25256 0.02019 0.26943 1.48762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.3572316 0.9353442 3.589 0.000337 ***
## educ 0.1572177 0.0550845 2.854 0.004345 **
## smsa66 0.0809666 0.0270566 2.992 0.002790 **
## exper 0.1183517 0.0236330 5.008 5.82e-07 ***
## expersq -0.0024146 0.0003498 -6.904 6.16e-12 ***
## black -0.1035834 0.0536749 -1.930 0.053722 .
## south66 -0.0637134 0.0279925 -2.276 0.022911 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4112 on 3003 degrees of freedom
## Multiple R-Squared: 0.143, Adjusted R-squared: 0.1413
## Wald test: 93.3 on 6 and 3003 DF, p-value: < 2.2e-16
2SLS estimation yields an estimate of 15.72% per year for returns to schooling. It is significantly higher than the OLS estimates, which suggest that the wage increase caused by education may be understated if we do not consider unobserved confounders.