In our previous section, the validity of our methodology mainly relies a crucial assumption: the unconfoundedness assumption (also called ignorability). This assumption states that the treatment assignment is independent of the potential outcomes given the observed covariates.
Unconfoundedness: \(A_i \perp \{Y_i(0), Y_i(1)\} | X_i\)
In other words, the treatment assignment is as good as random, given the observed covariates. Unfortunately, this assumption is often violated in practice because we are not always able to collect all the relevant covariates that affects the treatment assignment. In this section, we will introduce the instrumental variable (IV) method, which is a popular approach to estimate the causal effect in the presence of unobserved confounders.
The IV method can be viewed based on a graph representation (Pearl 2009) of the causal relationship between the treatment, the outcome, observed covariates, and the unobserved confounders. The following graph summarizes the components we have seen before and will see in this lecture. In our previous lectures, we mainly dealt with three components: the observed \(X\), treatment \(A\) and outcome \(Y\). With the unconfoundedness assumption, we can estimate the ATE using the observed covariates \(X\). There is an entire graph theory based on Judea Pearl’s idea regarding whether the causal effect can be identified, but we will not go deeper in that direction. For our purpose, we will consider a new setting, where we have an additional unobserved confounder \(U\) and an instrument \(Z\).
The immediate understanding of this graph is that, conditioning on \(X\), the treatment \(A\) is no longer independent of the potential outcomes, due to the presence of \(U\). However, since we are not able to observed \(U\), this becomes an impossible situation for us to estimate the treatment effect.
It would be helpful to assume a regression model first and then introduce the IV method. Consider the following linear model without covariates \(X\). Note that all results can be extended to the case by further conditioning on \(X\).
\[ Y_i = \alpha + \tau A_i + \epsilon_i \]
Keep in mind that because of the presents of unobserved confounders \(U\), the treatment assignment \(A\) is no longer independent of the potential outcomes. And that dependency would be presented through the error term \(\epsilon_i\). This effect is also called endogeneity. A simple linear regression would lead to the estimation of \(\tau\) as
\[ \begin{aligned} \hat{\tau}_\text{ols} &= \frac{\Cov[Y_i, A_i]}{\Var[A_i]} \\ &= \frac{\Cov[\tau A_i + \epsilon_i, A_i]}{\Var[A_i]} \\ &= \tau + \frac{\Cov[\epsilon_i, A_i]}{\Var[A_i]} \\ & \neq \tau \end{aligned} \]
Let’s consider the possibility of using an instrument \(Z\) to help us estimate the treatment effect. Our requirement for the instrument is that it should satisfy the following two conditions:
In this case, we could represent the relationship as
\[ \begin{aligned} Y_i &= \alpha + \tau A_i + \epsilon_i, \quad \epsilon \perp Z \\ A_i &= \beta + \gamma Z_i + \eta_i, \quad \eta \perp Z \end{aligned} \]
Then this implies that
\[ \begin{aligned} \Cov[Y_i, Z_i] &= \Cov[ \tau A_i + \epsilon_i, Z_i] \\ &= \tau \Cov[A_i, Z_i] + 0 \\ &= \tau \Cov[A_i, Z_i] \end{aligned} \]
We can then run another regression model \(A_i\) on \(Z_i\) to obtain \(\Cov[A_i, Z_i]\). Then we can estimate the treatment effect as
\[ \hat{\tau}_\text{iv} = \frac{\Cov[Y_i, Z_i]}{\Cov[A_i, Z_i]} \tag{1} \]
The idea of using instrumental variables was first proposed by Philip G. Wright in the “paper The tariff on animal and vegetable oils” (Wright 1928). In the paper, Philip explained why the price of flaxseed is determined by both supply and demand, but these relationships are interdependent, making it difficult to estimate the separate effects of supply and demand on price. But there were also speculations that the idea was first proposed by his son Sewall Wright. An interesting investigation of this was provided in Stock and Trebbi (2003).
One may wonder how to find such instrument variable. One example is Angrist (1990). In this paper, the research of interest is to analyze how does veteran status (\(A\)) effect earnings (\(Y\)). During the 1960s and 70s young men in the US were at risk of being drafted for military service in Vietnam. To do this in a fair way, in each year from 1970 to 1972, random sequence numbers were assigned to each birth date. Those with numbers below a cutoff were eligible for the draft, and those above the cutoff were not. The treatment variable is affected (ahtough not completely determined) by the lottery number (\(Z\)). The lottery number is not directly related to the earnings and satisfy the exclusion and exogenous assumptions.
Another example is compliance with the treatment. For example, in a clinical trial, the treatment is assigned randomly, but the some patients may refuse or cannot take the treatment due to some reasons. The actual treatment received \(A_i\) maybe different from the assigned treatment label \(Z\), but they are still related. In this case, we can use the assigned treatment label as the instrument, which can only be related to the outcome through compliance with the assignment.
We will perform a simulation study to illustrate the difference between the OLS and IV estimators. We generate a treatment label that depends on both the unobserved confounder \(U\) and the instrument \(Z\). The the AER
package provides a function ivreg
to run the IV regression. We will also write our own code to estimate the treatment effect using the IV method. The two should be identical. Its easy to see that direct regression between \(Y\) and \(A\) would lead to a biased estimate of the treatment effect.
# record the results
nsim = 500
n = 200
allcoef = matrix(NA, nsim, 3)
library(AER)
for (k in 1:nsim)
{
# simulate the data# generate some data
U = rnorm(n)
Z = rnorm(n)
A = rbinom(n, 1, exp(U + Z)/(1+exp(U + Z)))
Y = 1 + 2*A + U + rnorm(n)
# if we run a simple linear regression
allcoef[k, 1] = summary(lm(Y ~ A))$coef[2]
# if we use the IV method
allcoef[k, 2] = summary(ivreg(Y ~ A | Z))$coefficients[2, 1]
# our own code of IV method
allcoef[k, 3] = cov(Y, Z)/cov(A, Z)
}
# summarize the results
result = cbind(apply(allcoef, 2, mean),
apply(allcoef, 2, sd))
colnames(result) = c("Mean", "SD")
rownames(result) = c("OLS", "AER-ivreg", "IV")
result
## Mean SD
## OLS 2.719402 0.1956494
## AER-ivreg 2.012453 0.5615757
## IV 2.012453 0.5615757
An alternative way of expressing the IV method is to use a two-stage approach. For this illustration, we will assume that we observe one dimensional instrument \(Z\). Hence we observe an \(n\)-vectors \(\bA\), \(\bZ\) and \(\by\), as our data, and they are also centered so that \(\sum_i A_i = \sum_i Z_i = \sum_i Y_i = 0\). The IV method can be expressed as
\[ \begin{aligned} \frac{\hat{\Cov}[Y, Z]}{\hat{\Cov}[A, Z]} &= \frac{ \bZ^\T \by }{ \bZ^\T \bA } \\ &= (\bZ^\T \bA)^{-1} \bZ^\T \by \end{aligned} \]
Alternatively, we can interpret the IV method as estimating how much of the variation in \(Y\) is explained by the variation in \(A\) that is due to \(Z\). This can be translated into a two-stage approach. In the first stage, we regress \(A\) on \(Z\) to obtain the fitted values \(\hat{A}\). In the second stage, we regress \(Y\) on \(\hat{A}\) to obtain the estimate of the treatment effect. This is called the two-stage least squares (2SLS) approach. We can also quickly validate with the one-dimensional case that
\[ \begin{aligned} \hat{\bA} &= \bZ (\bZ^\T \bZ)^{-1} \bZ^\T \bA \\ \hat{\tau}_\text{iv} &= (\hat{\bA}^\T \hat{\bA})^{-1} \hat{\bA}^\T \by \\ &= (\bZ^\T \bA)^{-1} \bZ^\T \by \end{aligned} \]
However, in a multi-dimensional case, the 2SLS approach is more computationally efficient than the direct IV method, especially it is able to handle the case when we have multiple instruments. In addition, when we have other covariates involved (e.g., age, gender, race) that would also affect \(A\) and \(Y\) but can be measured, the 2SLS approach can be easily extended to include them in the first stage regression. In general it works in the following way and the graphical view is demonstrated
The following simulation shows the 2SLS approach using the ivmodel
package. We also write our own code to estimate the treatment effect using the 2SLS method. The two should be identical.
# record the results
nsim = 500
n = 200
allcoef = matrix(NA, nsim, 3)
library(ivmodel)
for (k in 1:nsim)
{
# simulate the data# generate some data
U = rnorm(n)
X = rnorm(n)
Z = X + rnorm(n)
A = rbinom(n, 1, exp(U + Z + X)/(1+exp(U + Z + X)))
Y = 1 + 2*A + U + X + rnorm(n)
# if we run a simple linear regression
allcoef[k, 1] = summary(lm(Y ~ A))$coef[2]
# if we use the IV method
allcoef[k, 2] = ivmodel(Y, A, Z, X, intercept = TRUE)$kClass$point.est[2]
# our own code of IV method
A_hat = lm(A ~ Z + X)$fitted.values
allcoef[k, 3] = lm(Y ~ X + A_hat)$coef[3]
}
# summarize the results
result = cbind(apply(allcoef, 2, mean),
apply(allcoef, 2, sd))
colnames(result) = c("Mean", "SD")
rownames(result) = c("OLS", "ivmodel", "IV-TS")
result
## Mean SD
## OLS 3.588072 0.2255799
## ivmodel 1.901717 0.8261776
## IV-TS 1.901717 0.8261776
When we have multiple instruments, any combination of these instruments can be used to estimate the treatment effect (check their three properties). However, the strength of the instrument is crucial. As we can see previously, the IV estimator has a large variance, since it is the ratio between two estimators, and the variance increases as the instrument becomes weaker. In fact, if we construct some combination of the multivariate instruments \(Z_i\), say \(w(Z_i)\), and use it as the instrument, the variance of the corresponding IV estimator can be written as
\[ n \Var[\hat{\tau}_\text{iv}^w] \rightarrow \frac{\Var[\epsilon]\Var[w(Z)]}{[\Cov(A, w(Z)]^2} \]
Note that the scale of \(w(Z)\) do not matter because the variance of the numerator essentially cancels out any scale change due to \(w(Z)\) in the square of the covariance of the denominator. This suggests that the more predictive \(w(Z)\) is for \(A\), the smaller the variance of the IV estimator. In fact the optimal solution is given by
\[ w(z) = \E[A | Z = z] \]
We can further verify that this would simply the previous variance expression to
\[ n \Var[\hat{\tau}_\text{iv}^w] \rightarrow \frac{\Var[\epsilon]}{\Var[\E[A | Z]]} \]
This again suggests that the IV estimator would perform well if \(Z\) is very predictive for \(A\). On the other hand, this also suggests some issues when the instrument is weak, meaning that the instrument \(Z\) does not predict \(A\) well, the performance of the IV estimator can be very poor with large variation and even systematically biased (Staiger and Stock 1994; Bound, Jaeger, and Baker 1995).
In many biological studies, we would like to analyze the effect of a certain exposure on a health outcomes. For example, we would like to analyze the effect of C-Reactive Protein (CRP) on the risk of coronary heart disease (CHD). In this case, CRP is our treatment variable, and CHD is our outcome variable. However, the CHD is often affected by many other factors, and it is difficult to measure all of them. Using instrumental variable could help us correct this bias, but where can we find these instruments? An idea is to use the genotype associated with the CRP from our DNA (e.g., SNP, copy number or methylation) as the instrument. It is generally believed that our DNA is being randomized at conception. If we are careful enough in terms of selecting the instrument, we would pick the one that is not directly related to the CHD, but is related to the CRP. Then, this allows us to perform the Mendelian randomization, which is a special case of the IV method. Lawlor et al. (2008) provides a clear illustration of this idea to analyze the association between CRP and insulin resistance.
One interesting thing about this idea is that we may not even need to collect individual patient data for this type of analysis. With the launch of the GWAS (Genome-Wide Association Studies), UK Biobank and many other smaller scale studies, we can easily obtain the summary statistics of the association between the SNP and the CRP and the association between the SNP and the CHD. Hence, in this case, by using the summary statistics, we can estimate \(\Cov(Y, A)\) and \(\Cov(Z, A)\) separately using different studies, and then estimate the treatment effect using the ratio given in (1) which is called the Wald estimator.
There can also be many limitations of this approach. For example, the instrument (genetic information) may be associated with the outcome though other pathways which we are not aware of. A common example is called pleiotropy, which means that the same genetic variation will affect multiple outcomes, and they may then indirectly affect the outcome that we are interested. Also, the population involved in different studies may be different, and the association between the SNP and the CRP may not be the same across different populations, or it could be difficult to interpret what the pooled population is. Weak instrument is also a common issue in this type of analysis. Nonetheless, Mendelian randomization plays an important role in the field of epidemiology and biostatistics.