A simulation study of hypothesis testing

Suppose we have the following scenario:

  • A teacher is interested in knowing if the average final exam score of his class is above 85.
  • The class has 48 students and the individual scores follow independent normal distribution with standard deviation 10.
  • Question:
    • What are the corresponding Null and Alternative hypotheses?
    • What evidence would be considered as “extreme” evidence against the Null?
    • Use an \(\alpha\) value 0.05, and use simulation data to determine the cut-off value to reject the Null
  • Suppose after the final exam, we observed the average score of this class is 87, should we reject or accept the Null?

First, since we are testing whether the mean score of the class of 48 students is greater than 85, our Null and alternative hypotheses are

\[\text{Null } H_0: \text{mean score = 85} \quad \text{vs.} \quad \text{Alternative } H_1: \text{mean score > 85}\] If we have very strong information against the Null, we could reject it. Now, we want to see what should be consider as “extreme” evidence against the Null, if the Null is true. Consider the following idea:

  • Generate many “averages of 48 students” (based on the Null assumption) and obtain the distribution of these means
  • Determine a cut-off value based on this distribution by looking at the extreme end, using a pre-defined \(\alpha\) value
  • Look at the actual observed mean value (87) and compare it to the cut-off value
  • Accept or reject the Null hypothesis
  # Generate 1000 such "classes" and obtain the distribution
  library(ggplot2)
  set.seed(1)
  
  allaverage = rep(NA, 1000)
  for (i in 1:1000)
    allaverage[i] = mean(rnorm(48, mean = 85, sd = 10))
  ggplot(data.frame(allaverage), aes(x=allaverage)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="white") +
    geom_density(alpha=.2, fill="#FF6666") +
    labs(x = "Simulated Average Scores") + 
    geom_vline(xintercept = 87, color = "blue", size = 1.2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  # Now, a surprising result should fall into the top 5% of that distribution
  quantile(allaverage, 0.95)
##     95% 
## 87.3478
  mean(allaverage > 87)
## [1] 0.073

The distribution (of averages) we are trying to obtain from this simulation is called the “Null distribution”. And if our assumption (Null) is true, then it would be about 7.3% of those mean scores that would be greater than 87. By our 0.05 significance level, this is not extreme information. Hence, we will not reject this hypothesis.

In fact, we can skip all the simulation process, and know the Null distribution exactly. This is mainly due to some statistical theories. In fact, for all the testing procedures in this lecture, we can always know the Null distribution, thanks to many early pioneers, such as Sir Ronald Fisher, Karl Pearson and William Gosset.

In short, the distribution of the average of 48 students, if each of them follows independently from normal distribution of mean 85 and standard deviation (SD) 10, would be normal distribution with mean 85 and standard deviation \(10/\sqrt{48}\). We can compare the theoretical distribution with the simulated data obtained previously. They are in fact very similar.

  # the simulated data
  p <- ggplot(data.frame(allaverage), aes(x=allaverage)) + 
        geom_histogram(aes(y=..density..), colour="black", fill="white") +
        geom_density(alpha=.2, fill="#FF6666") +
        labs(x = "Simulated Average Scores") + 
        geom_vline(xintercept = 87, color = "blue", size = 1.2)

  # add the normal distribution density
  x = seq(80, 90, length.out = 100)
  df = data.frame(x, "y" = dnorm(x, mean = 85, sd = 10/sqrt(48)))
  
  p + geom_line(data = df, aes(x = x, y = y), color = "red", size = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can also obtain the theoretical cut-off point based on this distribution, given any \(\alpha\) value. All of this process can be done with the following code to obtain the \(p\)-value.

  # calculate p-value of the z-test
  pnorm(87 - 85, sd = 10/sqrt(48), lower.tail = FALSE)
## [1] 0.08292833

Interpretation of \(p\)-values

The most common misunderstanding of p-values is

  • It is the probability of Null hypothesis being true

Instead, it indicates the probability that,

  • If the Null hypothesis is true, the chance that we observe evidence more extreme than the actual observed data

It should be noted that a hypothesis testing procedure and its conclusion depends a lot on the assumptions. When these assumptions are violated, we may end up with higher (sometimes lower) chance to reject the Null even when the Null is true. This would lead to higher (or lower) chance of type I error.

Type I and II errors

Hence, we can expect four possible outcomes:

Accept Null Reject Null
Null True Confidence Level (1 - \(\alpha\)) Type I Error (\(\alpha\))
Alternative True Type II Error Power

Based on our preliminary analysis, we will decide to reject the Null hypothesis if the observed mean is larger than 87.34. However, keep in mind that at this point we have not observe the actual data yet, and we do not know if the data comes from a Null or Alternative. Previously we only discussed if the Null is the data generator, but what if the Alternative is?

  • In this case we do want to reject the Null
  • How much chance we reject the Null if the true population mean is 86? (Power)
  • Sample size calculation is also related to this issue: if true population mean is 86 and we want 90% power, how much samples do we need? (see the sample size and CLT discuss later)

z-test

Formally, the above test is call the z-test. Besides the statistical mechanism that we have already talked about, one crucial thing about z-test is assuming that the standard deviation (\(\sigma\)) of a score is already know. This is usually not the case is practice. Hence, z-test is rarely used in practice unless we have strong prior information. Nonetheless, if we want to obtain the theoretical p-value from a test, we can use the normal distribution functions to help us.

Example 1: Suppose we are interested in testing whether the average score is larger than 80. However, assume that the true ability of each student follows a normal distribution with SD = 15. And we actually observe the average test score as 90. Can you obtain the \(p\)-value of this test? The pnorm() function calculates the probability that a normal distribution (with given mean and sd) is greater than a certain value.

  # 90-80 is the value
  # 15/sqrt(48) is sd
  # mean is 0 by default
  # lower.tail=FALSE means that we are looking at the upper side of the distribution
  pnorm(90 - 80, sd = 15/sqrt(48), lower.tail = FALSE)

Testing Mean with Unknown Variance: \(t\)-test

As we explained before, often times, We need to estimate the standard deviation \(\sigma\) or the variance \(\sigma^2\). This is done by calculating the (unbiased) sample variance

\[ \hat\sigma^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar x)^2\] The standard deviation estimation is simply taking a square-root of \(\hat\sigma^2\).

We will introduce several \(t\)-tests that utilize this quantity and the \(t\)-distribution.

  • One sample \(t\)-test
  • Two sample \(t\)-test

The essential idea of these tests is the same as the \(z\)-test: figuring out the Null distribution, and look at where the observed value lies in that distribution. If it is at the extreme end, with less than \(\alpha\) chance, then we reject the Null.

One sample \(t\)-test

A motivating example. The Cleveland clinic heart disease dataset aims at modeling the diagnosis of coronary heart disease. Let’s look at a simple task to analyze the age variable: we want to know if the average age of patients who has been identified with a coronary heart disease is larger than 50. We observe 303 individuals in this dataset.

  heart = read.csv("processed_cleveland.csv")
  heart
  hist(heart$age)

The difficulty lies in figuring out a way to incorporate the estimated SD instead of using the known truth. And researchers realized that the following quantity follows a \(t\) distribution with \(n-1\) degrees-of-freedom, if the Null hypothesis is true:

\[\frac{\bar{x} - \mu}{\hat\sigma / \sqrt{n}}.\] Here, \(\bar x\) is the sample mean we actually observed (such as 87 in the \(z\)-test), \(\mu\) is the Null hypothesis value (such as 85), and \(\hat\sigma\) is the estimated standard deviation (to replace 15). This is called the Student’s \(t\)-test. There is already an existing function that can be used to test this. The following code test the hypothesis

\[H_0: \text{mean age = 50} \quad \text{vs.} \quad H_1: \text{mean age > 50}\]

  t.test(heart$age, mu = 50, alternative = "greater")
## 
##  One Sample t-test
## 
## data:  heart$age
## t = 8.5486, df = 302, p-value = 3.151e-16
## alternative hypothesis: true mean is greater than 50
## 95 percent confidence interval:
##  53.58221      Inf
## sample estimates:
## mean of x 
##  54.43894

Let’s look at some of the outputs in this function:

  • \(t\) = 8.5486 displays exactly the quantity calculated from the above formula
  • df = 302 is the degrees-of-freedom, which is sample size 303 minus 1.
  • \(p\)-value (this you should be able to interpret yourself)
  • 95 percent confidence interval. This is a quantity reversely calculated from the formula to make inference about the true mean. We will explain this later.
  • mean of \(x\) is just the sample mean of age.

Remark on the \(t\) distribution: A \(t\) distribution with 1 DOF has much heavier tail than the normal distribution, hence is more “spread out”. You can notice from the plot below that as the degrees of freedom increases, the distribution becomes closer and closer to a standard normal distribution.

  x = seq(-5, 5, 0.01)
  den = data.frame("x" = x, "y" = dnorm(x), "dist" = "norm")
  den = rbind(den, data.frame("x" = x, "y" = dt(x, df = 1), "dist" = "t(1)"))
  den = rbind(den, data.frame("x" = x, "y" = dt(x, df = 2), "dist" = "t(3)"))
  den = rbind(den, data.frame("x" = x, "y" = dt(x, df = 6), "dist" = "t(6)"))
  
  ggplot(den, aes(x=x, y=y, group=dist)) +
    geom_line(aes(linetype=dist, color=dist), size = 1.5) +
    scale_linetype_manual(values=c("solid", "solid", "twodash", "dotted")) +
    scale_color_manual(values=c("darkorange", "deepskyblue", "deepskyblue", "deepskyblue")) + 
    theme(legend.key.size = unit(2, 'cm'))

Remark on the confidence interval: An interesting phenomenon about the sample mean \(\bar x\) is that, it will concentrate around the true population mean. And as the sample size increases, this concentration becomes tighter. This phenomenon is called the Central Limit Theorem. This is the foundation of almost all statistical research. In our context, the sample mean would eventfully varies around the true population mean, hence, the true population mean (unknown) should also be somewhere near the sample mean, as the sample size increases. And the amount of variation is quantified by \(\hat\sigma / \sqrt{n}\). Based on this fact, we can construct an interval that is wide enough to incorporate 95% of the uncertainty and draw inference of \(\mu\). The details of this is beyond this course.

Two sample \(t\)-test

Question: is the average age different across gender?

For this question, we can set up a new hypothesis testing problem:

\[H_0: \mu_1 = \mu_2 \quad \text{vs.} \quad H_1: \mu_1 \neq \mu_2,\] where \(\mu_1\) and \(\mu_2\) represent the two population means.

The idea is still the same: we will construct some test statistic (could be calculated from any observed samples), and this quantity would follow a known distribution, if the Null hypothesis is true. Hence after observing the actual samples, we evaluate if the observed version lies at the extreme end of the known Null distribution. It turns out that we can use the following statistic, if assuming that the variance of the two samples are the same:

\[t = \frac{\bar x_1 - \bar x_2}{\hat\sigma_p \sqrt{1/n_1 + 1/n_2}}\] Here, \(\bar x_1\) and \(\bar x_2\) are sample averages, \(n_1\) and \(n_2\) are the corresponding sample size, and \(\hat\sigma_p\) is a pooled standard deviation. However, this is NOT pooling all samples together then calculate the standard deviation. Details can be found here.

  male_age = heart$age[heart$sex == 1]
  female_age = heart$age[heart$sex == 0]
  t.test(male_age, female_age, alternative = "two.sided", var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  male_age and female_age
## t = -1.7004, df = 301, p-value = 0.09009
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.0701726  0.2967765
## sample estimates:
## mean of x mean of y 
##  53.83495  55.72165

Of course, you can imaging if we do not assume equal variance, there will be another version of the test. They can be quite different if the two populations displays very different variance. Moreover, the degrees of freedom associated with this test may not be integers.

  t.test(male_age, female_age, alternative = "two.sided", var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  male_age and female_age
## t = -1.6648, df = 178.57, p-value = 0.0977
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.1230253  0.3496293
## sample estimates:
## mean of x mean of y 
##  53.83495  55.72165

Remark on the paired \(t\)-test. This is a special case, when each subject is observed twice and we want to know if the first measurement is different from the second measurement, on average within this population. This is very popular in clinical trials. A theoretical problem is that if we estimate the pooled variance, it can be very large, and lead to insignificance. The strategy is to take the different between the two measurements from the same subject, and treat these differences as a one-sample \(t\)-test using 0 as the Null hypothesis. The following code demonstrates this effect.

  x1 = rnorm(100)
  x2 = x1 + 0.2 + rnorm(100, sd = 0.1)
  
  # without pairing
  t.test(x1, x2, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  x1 and x2
## t = -1.5518, df = 197.9, p-value = 0.1223
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.46410388  0.05534425
## sample estimates:
##   mean of x   mean of y 
## -0.10764767  0.09673214
  
  # with pairing
  t.test(x1, x2, alternative = "two.sided", paired = TRUE)
## 
##  Paired t-test
## 
## data:  x1 and x2
## t = -19.126, df = 99, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2255836 -0.1831760
## sample estimates:
## mean of the differences 
##              -0.2043798

Tests for contingency tables

Motivating question: is fasting blood sugar (fbs) prevalence different across different gender?

The challenge here is both variables are categorical. We can first use a contingency table to summarize the results:

  heart$sex = factor(heart$sex)
  levels(heart$sex) = c("female", "male")
  heart$fbs = factor(heart$fbs)
  levels(heart$fbs) = c("low", "high")
  table(heart$sex, heart$fbs)
##         
##          low high
##   female  85   12
##   male   173   33

One way to test if fbs is the same across different gender is the Chi-squared (\(\chi^2\)) test of independence. Since if the two variables are independent, then the prevalence would be the same regardless gender. The test is called Chi-squared (\(\chi^2\)) test because the test statistic (skipped here) approximately follows Chi-squared distribution if we have a sample size large enough. The following code can be used:

  chisq.test(heart$sex, heart$fbs)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  heart$sex and heart$fbs
## X-squared = 0.43559, df = 1, p-value = 0.5093

An alternative way of doing this test is called the Fisher’s exact test. This test is used in our Lady tasting tea problem when Fisher originally proposed the first hypothesis testing procedure. However, the test can be inefficient computationally because it requires calculating very large combinatorial numbers when the sample size is large. Moreover, in many situations, when the sample size is sufficient, it gives similar results as the Chi-squared (\(\chi^2\)) test.

  fisher.test(heart$sex, heart$fbs)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  heart$sex and heart$fbs
## p-value = 0.4897
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.6400317 3.0214601
## sample estimates:
## odds ratio 
##   1.349867

Multiple testing problems

In all of the above examples, we assumed that we can only observe the samples once. Let’s assume that there are 100 researchers, and each of them would observe their own samples.

  • Using the same testing procedure, will they arrive the same conclusion?
  • On average, how many of them would end up with a wrong conclusion?
  • What conditions could affect that result?

In genetic research, such as the genome-wide association studies, multiple testings and false discoveries are important issues.