\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\cA}{\mathcal{A}}\) \(\newcommand{\cB}{\mathcal{B}}\) \(\newcommand{\cC}{\mathcal{C}}\) \(\newcommand{\cD}{\mathcal{D}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cF}{\mathcal{F}}\) \(\newcommand{\cG}{\mathcal{G}}\) \(\newcommand{\cH}{\mathcal{H}}\) \(\newcommand{\cI}{\mathcal{I}}\) \(\newcommand{\cJ}{\mathcal{J}}\) \(\newcommand{\cK}{\mathcal{K}}\) \(\newcommand{\cL}{\mathcal{L}}\) \(\newcommand{\cM}{\mathcal{M}}\) \(\newcommand{\cN}{\mathcal{N}}\) \(\newcommand{\cO}{\mathcal{O}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\cQ}{\mathcal{Q}}\) \(\newcommand{\cR}{\mathcal{R}}\) \(\newcommand{\cS}{\mathcal{S}}\) \(\newcommand{\cT}{\mathcal{T}}\) \(\newcommand{\cU}{\mathcal{U}}\) \(\newcommand{\cV}{\mathcal{V}}\) \(\newcommand{\cW}{\mathcal{W}}\) \(\newcommand{\cX}{\mathcal{X}}\) \(\newcommand{\cY}{\mathcal{Y}}\) \(\newcommand{\cZ}{\mathcal{Z}}\) \(\newcommand{\bA}{\mathbf{A}}\) \(\newcommand{\bB}{\mathbf{B}}\) \(\newcommand{\bC}{\mathbf{C}}\) \(\newcommand{\bD}{\mathbf{D}}\) \(\newcommand{\bE}{\mathbf{E}}\) \(\newcommand{\bF}{\mathbf{F}}\) \(\newcommand{\bG}{\mathbf{G}}\) \(\newcommand{\bH}{\mathbf{H}}\) \(\newcommand{\bI}{\mathbf{I}}\) \(\newcommand{\bJ}{\mathbf{J}}\) \(\newcommand{\bK}{\mathbf{K}}\) \(\newcommand{\bL}{\mathbf{L}}\) \(\newcommand{\bM}{\mathbf{M}}\) \(\newcommand{\bN}{\mathbf{N}}\) \(\newcommand{\bO}{\mathbf{O}}\) \(\newcommand{\bP}{\mathbf{P}}\) \(\newcommand{\bQ}{\mathbf{Q}}\) \(\newcommand{\bR}{\mathbf{R}}\) \(\newcommand{\bS}{\mathbf{S}}\) \(\newcommand{\bT}{\mathbf{T}}\) \(\newcommand{\bU}{\mathbf{U}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\ba}{\mathbf{a}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\bd}{\mathbf{d}}\) \(\newcommand{\be}{\mathbf{e}}\) \(\newcommand{\bg}{\mathbf{g}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bi}{\mathbf{i}}\) \(\newcommand{\bj}{\mathbf{j}}\) \(\newcommand{\bk}{\mathbf{k}}\) \(\newcommand{\bl}{\mathbf{l}}\) \(\newcommand{\bm}{\mathbf{m}}\) \(\newcommand{\bn}{\mathbf{n}}\) \(\newcommand{\bo}{\mathbf{o}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bq}{\mathbf{q}}\) \(\newcommand{\br}{\mathbf{r}}\) \(\newcommand{\bs}{\mathbf{s}}\) \(\newcommand{\bt}{\mathbf{t}}\) \(\newcommand{\bu}{\mathbf{u}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\bxi}{\boldsymbol{\xi}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bzero}{\mathbf{0}}\) \(\newcommand{\T}{\text{T}}\) \(\newcommand{\Trace}{\text{Trace}}\) \(\newcommand{\Cov}{\text{Cov}}\) \(\newcommand{\Var}{\text{Var}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\pr}{\text{Pr}}\) \(\newcommand{\p}{\text{p}}\) \(\newcommand{\Prob}{\text{P}}\) \(\newcommand{\argmin}{\operatorname*{arg\,min}}\) \(\newcommand{\argmax}{\operatorname*{arg\,max}}\)

Overview

Personalized medicine, also known as precision medicine, is a medical model that tailors medical treatment to the individual characteristics of each patient. The Human Genome Project completed in 2003 opened new horizons in understanding individual genetic profiles (Ginsburg and Willard 2009). In a broad view, the goal of personalized medicine (Kosorok and Laber 2019) is to move beyond the ‘one-size-fits-all’ approach, accounting for genetic, environmental, and lifestyle factors. Topics in personalized medicine include genomics, pharmacogenomics, and biomarker discovery, which facilitate the development of targeted therapies and individualized treatment plans. This course focus on some of the data driven and machine learning approaches under this topic. Applications of statistical models are not limited to medicine. They are broadly used in many fields, ranging from economics and marketing to cognitive science, environmental science and sports analytics. Essentially anytime a decision need to be made based on individual characteristics, such models can be useful.

Causal Treatment Effects

To estimate the best treatment for an individual, one fundamental concept is the treatment effect. Let’s define some notation. Suppose we have a binary treatment label \(A \in \{0, 1\}\), for which 1 indicates a type of treatment and 0 indicates another treatment or simply the placebo control. We also have a response variable \(Y\), which is the outcome of interest. For an individual \(i\), this could potentially give us two outcomes: \(Y_i(0)\) and \(Y_i(1)\), and the difference between the two is the causal treatment effect 1:

\[ \Delta_i = Y_i(1) - Y_i(0) \]

The fundamental difficulty in estimating the treatment effect is that we can only observe one of the two outcomes. For example, if you revived offers from two universities, UIUC and Harvard, and you can only choose one of them. If you choose UIUC and observe the outcome of your salary after graduation, then you would never observe the outcome of going to Harvard. The two are essentially in two parallel universes. This dilemma between the actual (or realized) and counterfactual outcomes is known as the fundamental problem of causal inference by Holland (1986). However, some statistical framework would allow us to estimate the treatment effect, or its related quantities. The most commonly accepted view is the potential outcome framework, which originated in Neyman (1923) who considered a randomized experiment, and Rubin (1974) who considered observational (non-randomized) studies. In this section, let us explore this framework using the randomized control trial and discuss some of its crucial assumptions.

Randomized Control Trial

Let’s consider the average treatment effect (ATE), which is the treatment effect averaged over the entire population:

\[ \tau = \E[\Delta_i] = \E[Y_i(1)] - \E[Y_i(0)] \]

To entertain the idea of potential outcomes, let’s suppose that we could actually walk both the parallel universe. Then we would observe both \(Y_i(1)\) and \(Y_i(0)\). A nature estimator would be

\[ \frac{1}{n} \sum_{i=1}^n \Delta_i = \frac{1}{n} \sum_{i=1}^n \big( Y_i(1) - Y_i(0) \big) \tag{1} \]

This quantity is called a sample average treatment effect (SATE), a hypothetical unbiased estimator for \(\tau\). We can easily see that it is an unbiased estimator of the ATE. However, this estimator is not possible in reality.

In reality, we could randomly assign the treatment to a group of individuals and compare the outcomes between the two groups. This is known as the randomized control trial (RCT). In this case, we only observe one of the two potential outcomes for each individual:

\[ \begin{aligned} Y_i &= A_i Y_i(1) + (1 - A_i) Y_i(0)\\ &= \begin{cases} Y_i(1) & \text{if } A_i = 1 \\ Y_i(0) & \text{if } A_i = 0 \end{cases} \end{aligned} \]

where \(A_i\) is the treatment assignment for individual \(i\).

The Estimator

A naive (but pretty good) idea is to estimate the ATE using the mean differences of the two groups, which is called the difference-in-means estimator:

\[ \begin{aligned} \widehat\tau &= \frac{1}{n_1} \sum_{A_i = 1} Y_i - \frac{1}{n_0} \sum_{A_i = 0} Y_i \\ &= \frac{1}{n_1} \sum_{i = 1}^n A_i Y_i - \frac{1}{n_0} \sum_{i = 1}^n (1 - A_i) Y_i \end{aligned} \]

where \(n_1\) and \(n_0\) are the sample sizes of the two groups. The advantage of using RCT, as we will see later, is that this estimator is unbiased even if there are other covariates that affect the outcome and we will discuss the utilization of other covariates later.

Assumptions

The unbiasedness of this estimator can be established by setting a connection with the SATE estimator (1). Besides assuming an i.i.d. (independent and identically distributed) set of samples, we need two important assumptions:

Independence: \(A_i \perp \{Y_i(0), Y_i(1)\}\)

This is a pretty natural assumption, which suggests that the assignment of treatment has nothing to do with all the possible potential outcomes. What situation would violate this assumption? For example, a patient may choose what he/she believes to be the better treatment. In this case, the treatment assignment is not independent of the potential outcomes.

SUTVA (Stable Unit Treatment Values Assumption): \(Y_i = Y_i(A_i)\)

This assumption is a bit subtle. It has two components.

  • No interference: the outcome of individual \(i\) is not affected by other individuals.
  • Consistency: There are no hidden forms of treatment, i.e., the outcome of individual \(i\) is the same as the outcome of individual \(i\) under the same treatment assignment.

The first part is relatively easy to understand. For example, if you are a patient in a clinical trial, your outcome should not be affected by the treatment of other patients. However, if all patients share and are competing on the same medical resources, then the outcome of one patient may be affected by the treatment of others. The second part is somewhat philosophical and it can be closely related to how we define the treatment. One could simply say that this is automatically satisfied. But if the patient does not take the treatment as prescribed, then the outcome may be different. In that case, there is an entire area of research on noncompliance and intention-to-treat (Gupta 2011) analysis.

Unbiasedness

As we discussed previously, we could consider the difference-in-means estimator. In fact, under the assumptions we gave previously, this estimator is unbiased for the SATE, and hence, the ATE. To see this, let’s consider the conditional expectation of the estimator given the potential outcomes and the treatment assignment 2:

\[ \begin{aligned} & \, \E \left[ \frac{1}{n_1} \sum_{i = 1}^n A_i Y_i \Biggm| \{Y_i(0), Y_i(1)\}_{i = 1}^n, n_1 \right] \\ =& \, \E \left[ \frac{1}{n_1} A_i Y_i(1) \Biggm| \{Y_i(0), Y_i(1)\}_{i = 1}^n, n_1 \right] \quad \text{by SUTVA} \\ =& \, \frac{1}{n_1} \sum_{i = 1}^n Y_i(1) \, \E \Big[ A_i \Bigm| \{Y_i(0), Y_i(1)\}_{i = 1}^n, n_1 \Big] \quad Y_i(1)'s \text{ are constant} \\ =& \, \frac{1}{n_1} \sum_{i = 1}^n Y_i(1) \frac{n_1}{n} \quad \text{by Independence} \\ =& \, \frac{1}{n} \sum_{i = 1}^n Y_i(1) \\ \end{aligned} \]

Here, we made an assumption that the treatment assignment is random, i.e., \(\E[A_i] = \Pr(A_i = 1) = n_1 / n\). The same argument can be applied to the second term. Therefore, the difference-in-means estimator is unbiased for SATE, and so is for ATE.

Asymptotic Inference

We can then analyze the variance of this estimator, given the sample size of both arms 3.

\[ \Var(\widehat \tau | n_0, n_1) = \frac{1}{n_1} \Var\big[Y_i(1)\big] + \frac{1}{n_0} \Var\big[Y_i(0)\big] \tag{2} \]

By the central limit theorem, \(\widehat \tau\) is asymptotically normal:

\[ \frac{ \widehat \tau - \tau }{ \sigma_1^2/n_1 + \sigma_0^2/n_0 } \xrightarrow{d} \cN \left(0, 1\right) \]

where \(\sigma_1^2 = \Var\big[Y_i(1)\big]\) and \(\sigma_0^2 = \Var\big[Y_i(0)\big]\). Sample version estimates of these two quantities are:

\[ \begin{aligned} \widehat \sigma_1^2 &= \frac{1}{n_1 - 1} \sum_{A_i = 1} \left( Y_i - \bar Y_1\right)^2 \\ \widehat \sigma_0^2 &= \frac{1}{n_0 - 1} \sum_{A_i = 0}^n \left( Y_i - \bar Y_0\right)^2 \end{aligned} \]

where \(\bar Y_1 = \frac{1}{n_1} \sum_{A_i = 1} Y_i\) and \(\bar Y_0 = \frac{1}{n_0} \sum_{A_i = 0} Y_i\) are the sample means of the two groups respectively. Me may construct the confidence interval for the ATE as:

\[ \widehat \tau \pm z_{1 - \alpha / 2} \sqrt{\widehat V} \]

where \(z_{1 - \alpha / 2}\) is the \((1 - \alpha / 2)\)-th quantile of the standard normal distribution. Ideally you can also use the t-distribution if the sample size is relatively small, but need to correct for the degrees of freedom and also calcuate the appropriate standard error.

Numerical Example

Let’s use some simulation study to illustrate the properties of the difference-in-means estimator. Suppose we have \(n = 200\) patients, and we randomly assign \(n_1 = 100\) patients to the treatment group and \(n_0 = 100\) patients to the control group. Suppose our outcomes are generated from a linear model

\[ \E(Y | X = x) = 0.5 \times x + A \times x^2. \tag{3} \]

In this case, the potential outcomes for a subject with covariate value \(x\) are \(0.5\times x + x^2\) for treatment 1 and \(0.5\times x\) for treatment 0. If \(X\) follows a standard normal distribution, the expected treatment effect is \(\tau = E(X^2) = 1\). In a regression problem, we typically want to observe \(X\) and model the relationship between \(X\) and \(Y\) and then infer the treatment effect. However, in RCT, this is not necessary. Let’s generate the observed outcomes and estimate the ATE using the difference-in-means estimator without using \(X\). We repeat this process for 200 times and point estimations and their confidence intervals.

  # setting parameters
  n <- 200
  n1 <- 100
  n0 <- 100
  tau <- 1 # true ATE
  nsim = 500
  
  set.seed(1)
  
  tauhat <- rep(NA, nsim)
  tauhatsd <- rep(NA, nsim)
  
  for (i in 1:nsim) {
    # generate potential outcomes
    X <- rnorm(n)
    Y1 <- rnorm(n, mean = 0.5*X+X^2)
    Y0 <- rnorm(n, mean = 0.5*X)
    
    # treatment label
    A = sample(c(rep(1, n1), rep(0, n0)))
    
    # observed outcomes
    Y <- A * Y1 + (1 - A) * Y0
    
    tauhat[i] <- mean(Y[A == 1]) - mean(Y[A == 0])
    tauhatsd[i] <- sqrt(var(Y[A == 1]) / n1 + var(Y[A == 0]) / n0)
  }

  # make two plots on the same row
  par(mfrow = c(1, 2))
  
  # set margin of figure
  par(mar = c(4, 4, 1, 1))
  plot(tauhat[1:50], pch = 19, ylim = c(0, 2),
       xlab = "Simulation Runs", ylab = "Estimated ATE")
  abline(h = tau, col = "red")
  
  # adding confidence intervals
  for (i in 1:50) {
    ci_lower <- tauhat[i] - 1.96 * tauhatsd[i]
    ci_upper <- tauhat[i] + 1.96 * tauhatsd[i]
    arrows(x0 = i, y0 = ci_lower, x1 = i, y1 = ci_upper, angle = 90, code = 3, length = 0.05)
  }
  
  coverage = sum((tauhat - 1.96 * tauhatsd < tau) & (tauhat + 1.96 * tauhatsd > tau)) / nsim
  legend("topright", paste("Coverage probability of", nsim, "runs:", coverage))
  
  boxplot(tauhat, xlab = "Boxplot of Estimated ATE", ylab = "Estimated ATE")
  abline(h = tau, col = "red")

The result roughly demonstrates that the point estimation is unbiased and the confidence interval has the correct coverage probability. We should note that in this estimation, we did not use the covariate information. The unbiasedness comes purely from the randomization of the treatment assignment.

Discussion 1: Observational Study

In fact, the difference-in-means estimator is pretty much the same as a two-sample test and its confidence interval estimation. Suppose we have two groups of samples, \(X_1, \ldots, X_{n_1}\) and \(Y_1, \ldots, Y_{n_0}\), and we want to test whether the two groups have the same mean. The test statistic is the difference in means, while the asymptotic variance is the same as in (2). Here we use the asymptotic variance rather than a \(t\) distribution.

However, there is an important underlying assumption of how the data are generated, which further valid the purpose of the test. When we use a two-sample test in general, we assume that the two populations are generated from the same distribution (or same mean) under the null hypotheses. In estimating the ATE, the random assignment of the treatment label would allow the two groups to be generated from the same distribution if there is no treatment effect in any ways. Hence, the the test statistic can be the same. However, if we are not under the RCT, the two groups may not be generated from the different distributions even there is no treatment effect! Hence, the difference-in-means estimator would not be valid. This is essentially an observational study situation, which can be illustrated by the following example.

In a randomized trial design, our analysis shows that, even we do not utilize the covariate information, we can still obtain an unbiased estimation of the average treatment effect. This is a very important property of RCT, and it is the reason why RCT is considered the gold standard for estimating the treatment effect. However, in many cases, the treatment cannot be assigned randomly for practical and ethical reasons. Such a scenario is the observational study. In that case, the difference-in-means estimator will be biased.4 Here is a simulation study. In this case, the treatment label is generated by a logistic model with covariate \(X\), which makes higher \(X\) values more likely to take treatment 1. Some correction will be needed to estimate the treatment effect in observational studies. We will discuss this in the next lecture.

  # setting parameters
  n <- 200
  n1 <- 100
  n0 <- 100
  tau <- 1 # true ATE
  nsim = 500
  
  set.seed(1)
  
  tauhat <- rep(NA, nsim)
  tauhatsd <- rep(NA, nsim)
  
  for (i in 1:nsim) {
    # generate potential outcomes
    X <- rnorm(n)
    Y1 <- rnorm(n, mean = 0.5*X+X^2)
    Y0 <- rnorm(n, mean = 0.5*X)
    
    # treatment label
    A = rbinom(n, 1, exp(X) / (1 + exp(X)))
    
    # observed outcomes
    Y <- A * Y1 + (1 - A) * Y0
    
    tauhat[i] <- mean(Y[A == 1]) - mean(Y[A == 0])
    tauhatsd[i] <- sqrt(var(Y[A == 1]) / n1 + var(Y[A == 0]) / n0)
  }

  # make two plots on the same row
  par(mfrow = c(1, 2))
  
  # set margin of figure
  par(mar = c(4, 4, 1, 1))
  plot(tauhat[1:50], pch = 19, ylim = c(0, 2),
       xlab = "Simulation Runs", ylab = "Estimated ATE")
  abline(h = tau, col = "red")
  
  # adding confidence intervals
  for (i in 1:50) {
    ci_lower <- tauhat[i] - 1.96 * tauhatsd[i]
    ci_upper <- tauhat[i] + 1.96 * tauhatsd[i]
    arrows(x0 = i, y0 = ci_lower, x1 = i, y1 = ci_upper, angle = 90, code = 3, length = 0.05)
  }
  
  coverage = sum((tauhat - 1.96 * tauhatsd < tau) & (tauhat + 1.96 * tauhatsd > tau)) / nsim
  legend("topright", paste("Coverage probability of", nsim, "runs:", coverage))
  
  boxplot(tauhat, xlab = "Boxplot of Estimated ATE", ylab = "Estimated ATE")
  abline(h = tau, col = "red")

In this lecture, we did not discuss the estimation of the treatment effect using covariates. Additionally, in many cases, we are not only interested in the averaged treatment effect over the entire population, but also the treatment effect for each individual, which can be different. This is often refereed to as the heterogeneous treatment effect. Estimating the heterogeneous treatment effect is a more challenging problem, and will need to utilize covariate information. But additional assumptions are also needed.

Discussion 2: Regression Based Estimator

The following analysis is adapted from Wager (2020). Let’s consider a case where both potential outcomes are generated from a linear regression:

\[ Y_i(a) = c(a) + X_i \beta_{(a)} + \epsilon_i(a), \quad a = 0, 1 \]

where \(c(a)\) is the intercept, \(\beta_a\) is the coefficient, and \(\epsilon_i(a)\) is the error term satisfying

\[ \E[ \epsilon_i(a) | X_i] = 0, \quad \Var[ \epsilon_i(a) | X_i ] = \sigma^2 \]

We can also assume, wlog, that the randomized trial is balanced with \(\Pr(A_i = 1) = \Pr(A_i = 0) = 0.5\) and covariate \(X\) is mean 0 and standardized with variance 1.

\[ \E[ X ] = 0 \quad \text{and} \quad \Var[ X ] = 1. \]

If we analyze the difference-in-means estimator, we can see that

\[ \begin{aligned} n \Var(\widehat \tau | n_0, n_1) &= 2 \Var[Y_i(1)] + 2 \Var[Y_i(0)] \\ &= 4 \sigma^2 + 2 \Var[X_i \beta_{(1)}] + 2\Var[X_i \beta_{(0)}] \\ &= 4 \sigma^2 + 2 \beta_{(1)}^2 + 2 \beta_{(0)}^2 \\ &= 4 \sigma^2 + ( \beta_{(1)} + \beta_{(0)} )^2 + ( \beta_{(1)}^2 - \beta_{(0)} )^2 \\ \end{aligned} \]

Now if we fit linear regression models to estimate these parameters, we would \(\widehat c(1)\) and \(\widehat \beta_{(1)}\) from the samples received treatment 1 and \(\widehat c(0)\) and \(\widehat \beta_{(0)}\) from the samples received treatment 0. The difference-in-means estimator can be written as

\[ \widehat \tau_\text{OLS} = \widehat c(1) - \widehat c(0) + \bar X \left( \widehat \beta_{(1)} - \widehat \beta_{(0)} \right) \]

by analyzing the difference between \(\widehat \tau_\text{OLS}\) and \(\tau\), the average treatment effect, we have

\[ \begin{aligned} \widehat \tau_\text{OLS} - \tau_n &= \left( \widehat c(1) - c(1) \right) - \left( \widehat c(0) - c(0) \right) + \bar X \left( \beta_{(1)} - \beta_{(0)} \right) \\ & \quad + \bar X \left( \widehat \beta_{(1)} - \beta_{(1)} \right) - \bar X \left( \widehat \beta_{(0)} - \beta_{(0)} \right) \end{aligned} \tag{4} \]

Keep in mind that for a regression model, the asymptotic variance of parameter estimates is \(\sigma^2 (Z^\T Z)^{-1}\) where \(Z = \text{cbind}(1, X)\) is the design matrix. In our case, \(X\) is independent of the intercept covariate, so \((Z^\T Z)^{-1}\) is a diagonal matrix. In our case, the two diagonal terms are in the rate of \(1/n_a\) since \(X\) has variance 1.

Back to (4), the first two terms are asymptotically normal with variance \(2\sigma^2/n\), and the third term relies only on the rate of \(\bar X\) converging to 0, making it asymptotically normal with variance \((\beta_{(1)} - \beta_{(0)})^2/n\). And the last two terms converging to zero at a much faster rate (since its a product between \(\bar X\) and small estimation errors of the \(\beta\)’s. Hence, overall we have

\[ \sqrt{n} (\widehat \tau_\text{OLS} - \tau ) \xrightarrow{d} \cN \left(0, 4\sigma^2 + (\beta_{(1)} - \beta_{(0)})^2 \right) \]

Compare with the asymptotic variance of the difference-in-means estimator, we can see that the OLS estimator is more efficient with \(( \beta_{(1)} + \beta_{(0)} )^2\) term disappeared. Let’s use a simulation to compare the two estimators.

  # setting parameters
  n <- 500
  n1 <- n/2
  n0 <- n/2
  tau <- 0.3 # true ATE  
  nsim = 500
  
  set.seed(1)
  
  tauhat_dim <- rep(NA, nsim)
  tauhat_ols <- rep(NA, nsim)
    
  for (i in 1:nsim) {
    # generate potential outcomes
    X <- rnorm(n)
    Y1 <- rnorm(n, mean = 0.5 + tau + X)
    Y0 <- rnorm(n, mean = 0.5 + 2*X) # you can change this coefficient to see the difference
    
    # treatment label
    A = sample(c(rep(1, n1), rep(0, n0)))
    
    # observed outcomes
    Y <- A * Y1 + (1 - A) * Y0
    
    # Difference-in-means estimator
    tauhat_dim[i] <- mean(Y[A == 1]) - mean(Y[A == 0])
    
    # OLS estimator
    X1 <- X[A == 1]
    Y_1 <- Y[A == 1]
    X0 <- X[A == 0]
    Y_0 <- Y[A == 0]
    
    model1 = lm(Y_1 ~ X1)
    model0 = lm(Y_0 ~ X0)
    
    tauhat_ols[i] <- coef(model1)[1] - coef(model0)[1] + mean(X) * (coef(model1)[2] - coef(model0)[2])
  }

  # compare the performance of the two estimators
  mean(tauhat_dim)
## [1] 0.3036613
  mean(tauhat_ols)
## [1] 0.301743
  sd(tauhat_dim)
## [1] 0.1697722
  sd(tauhat_ols)
## [1] 0.09835734

Reference

Eidelman, Rachel S, Danielle Hollar, Patricia R Hebert, Gervasio A Lamas, and Charles H Hennekens. 2004. “Randomized Trials of Vitamin e in the Treatment and Prevention of Cardiovascular Disease.” Archives of Internal Medicine 164 (14): 1552–56.
Ginsburg, Geoffrey S, and Huntington F Willard. 2009. “Genomic and Personalized Medicine: Foundations and Applications.” Translational Research 154 (6): 277–87.
Gupta, Sandeep K. 2011. “Intention-to-Treat Concept: A Review.” Perspectives in Clinical Research 2 (3): 109.
Holland, Paul W. 1986. “Statistics and Causal Inference.” Journal of the American Statistical Association 81 (396): 945–60.
Kosorok, Michael R, and Eric B Laber. 2019. “Precision Medicine.” Annual Review of Statistics and Its Application 6: 263–86.
Lee, I-Min, Nancy R Cook, J Michael Gaziano, David Gordon, Paul M Ridker, JoAnn E Manson, Charles H Hennekens, and Julie E Buring. 2005. “Vitamin e in the Primary Prevention of Cardiovascular Disease and Cancer: The Women’s Health Study: A Randomized Controlled Trial.” Jama 294 (1): 56–65.
Neyman, Jerzy. 1923. “On the Application of Probability Theory to Agricultural Experiments. Essay on Principles.” Ann. Agricultural Sciences, 1–51.
Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66 (5): 688.
Stampfer, Meir J, Charles H Hennekens, JoAnn E Manson, Graham A Colditz, Bernard Rosner, and Walter C Willett. 1993. “Vitamin e Consumption and the Risk of Coronary Disease in Women.” New England Journal of Medicine 328 (20): 1444–49.
Wager, Stefan. 2020. “Stats 361: Causal Inference.” None.

  1. In this notation, we usually consider each \(\Delta_i\) as a random variable. They may or may not have the same mean value across different individuals (e.g., if they depend on additional covariates), but that will not change the main result of this section. In our later lectures, we could consider using additional covariates \(X\) to improve our estimation if they are available.↩︎

  2. We condition on the potential outcomes because it avoids making assumptions on how they are generated. And this is also the reason why we need to assume SUTVA and independence.↩︎

  3. The reason we do not consider \(n_0\) and \(n_1\) as random is that this is practically more prevalent. In a clinical trial, we usually do not collect a fixed number of \(n\) patients and then randomize them. For example, some studies recruit pairs of patients and randomize each pair into 1 or 0. Hence, as long as the marginal distribution of \(A_i\) is the same for all patients, the properties we established would hold. This avoids making assumptions on how \(A_i\)’s are generated.↩︎

  4. A famous example was about vitamin E supplement and the risk of coronary heart disease. Several papers on New England Journal of Medicine in 1993 (Stampfer et al. 1993, 1993) reported that the vitamin E supplement can reduce the risk of coronary heart disease using observational study. However, later studies (Eidelman et al. 2004; Lee et al. 2005) using randomized control trial reports no significant effect.↩︎