Suppose we have the following scenario:
First, since we are testing whether the mean score of the class of 48 students is greater than 85, our Null and alternative hypotheses are
\[\text{Null } H_0: \text{mean score = 85} \quad \text{vs.} \quad \text{Alternative } H_1: \text{mean score > 85}\] If we have very strong information against the Null, we could reject it. Now, we want to see what should be consider as “extreme” evidence against the Null, if the Null is true. Consider the following idea:
# Generate 1000 such "classes" and obtain the distribution
library(ggplot2)
set.seed(1)
allaverage = rep(NA, 1000)
for (i in 1:1000)
allaverage[i] = mean(rnorm(48, mean = 85, sd = 10))
ggplot(data.frame(allaverage), aes(x=allaverage)) +
geom_histogram(aes(y=..density..), colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666") +
labs(x = "Simulated Average Scores") +
geom_vline(xintercept = 87, color = "blue", size = 1.2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Now, a surprising result should fall into the top 5% of that distribution
quantile(allaverage, 0.95)
## 95%
## 87.3478
mean(allaverage > 87)
## [1] 0.073
The distribution (of averages) we are trying to obtain from this simulation is called the “Null distribution”. And if our assumption (Null) is true, then it would be about 7.3% of those mean scores that would be greater than 87. By our 0.05 significance level, this is not extreme information. Hence, we will not reject this hypothesis.
In fact, we can skip all the simulation process, and know the Null distribution exactly. This is mainly due to some statistical theories. In fact, for all the testing procedures in this lecture, we can always know the Null distribution, thanks to many early pioneers, such as Sir Ronald Fisher, Karl Pearson and William Gosset.
In short, the distribution of the average of 48 students, if each of them follows independently from normal distribution of mean 85 and standard deviation (SD) 10, would be normal distribution with mean 85 and standard deviation \(10/\sqrt{48}\). We can compare the theoretical distribution with the simulated data obtained previously. They are in fact very similar.
# the simulated data
p <- ggplot(data.frame(allaverage), aes(x=allaverage)) +
geom_histogram(aes(y=..density..), colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666") +
labs(x = "Simulated Average Scores") +
geom_vline(xintercept = 87, color = "blue", size = 1.2)
# add the normal distribution density
x = seq(80, 90, length.out = 100)
df = data.frame(x, "y" = dnorm(x, mean = 85, sd = 10/sqrt(48)))
p + geom_line(data = df, aes(x = x, y = y), color = "red", size = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can also obtain the theoretical cut-off point based on this distribution, given any \(\alpha\) value. All of this process can be done with the following code to obtain the \(p\)-value.
# calculate p-value of the z-test
pnorm(87 - 85, sd = 10/sqrt(48), lower.tail = FALSE)
## [1] 0.08292833
The most common misunderstanding of p-values is
Instead, it indicates the probability that,
It should be noted that a hypothesis testing procedure and its conclusion depends a lot on the assumptions. When these assumptions are violated, we may end up with higher (sometimes lower) chance to reject the Null even when the Null is true. This would lead to higher (or lower) chance of type I error.
Hence, we can expect four possible outcomes:
Accept Null | Reject Null | |
---|---|---|
Null True | Confidence Level
(1 - \(\alpha\)) |
Type I Error
(\(\alpha\)) |
Alternative True | Type II Error |
Power |
Based on our preliminary analysis, we will decide to reject the Null hypothesis if the observed mean is larger than 87.34. However, keep in mind that at this point we have not observe the actual data yet, and we do not know if the data comes from a Null or Alternative. Previously we only discussed if the Null is the data generator, but what if the Alternative is?
Formally, the above test is call the z-test. Besides the statistical mechanism that we have already talked about, one crucial thing about z-test is assuming that the standard deviation (\(\sigma\)) of a score is already know. This is usually not the case is practice. Hence, z-test is rarely used in practice unless we have strong prior information. Nonetheless, if we want to obtain the theoretical p-value from a test, we can use the normal distribution functions to help us.
Example 1: Suppose we are interested in testing whether the average
score is larger than 80. However, assume that the true ability of each
student follows a normal distribution with SD = 15. And we actually
observe the average test score as 90. Can you obtain the \(p\)-value of this test? The
pnorm()
function calculates the probability that a normal
distribution (with given mean and sd) is greater than a certain
value.
# 90-80 is the value
# 15/sqrt(48) is sd
# mean is 0 by default
# lower.tail=FALSE means that we are looking at the upper side of the distribution
pnorm(90 - 80, sd = 15/sqrt(48), lower.tail = FALSE)
As we explained before, often times, We need to estimate the standard deviation \(\sigma\) or the variance \(\sigma^2\). This is done by calculating the (unbiased) sample variance
\[ \hat\sigma^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar x)^2\] The standard deviation estimation is simply taking a square-root of \(\hat\sigma^2\).
We will introduce several \(t\)-tests that utilize this quantity and the \(t\)-distribution.
The essential idea of these tests is the same as the \(z\)-test: figuring out the Null distribution, and look at where the observed value lies in that distribution. If it is at the extreme end, with less than \(\alpha\) chance, then we reject the Null.
A motivating example. The Cleveland
clinic heart disease dataset aims at modeling the diagnosis of
coronary heart disease. Let’s look at a simple task to analyze the
age
variable: we want to know if the average age of
patients who has been identified with a coronary heart disease is larger
than 50. We observe 303 individuals in this dataset.
heart = read.csv("processed_cleveland.csv")
heart
hist(heart$age)
The difficulty lies in figuring out a way to incorporate the estimated SD instead of using the known truth. And researchers realized that the following quantity follows a \(t\) distribution with \(n-1\) degrees-of-freedom, if the Null hypothesis is true:
\[\frac{\bar{x} - \mu}{\hat\sigma / \sqrt{n}}.\] Here, \(\bar x\) is the sample mean we actually observed (such as 87 in the \(z\)-test), \(\mu\) is the Null hypothesis value (such as 85), and \(\hat\sigma\) is the estimated standard deviation (to replace 15). This is called the Student’s \(t\)-test. There is already an existing function that can be used to test this. The following code test the hypothesis
\[H_0: \text{mean age = 50} \quad \text{vs.} \quad H_1: \text{mean age > 50}\]
t.test(heart$age, mu = 50, alternative = "greater")
##
## One Sample t-test
##
## data: heart$age
## t = 8.5486, df = 302, p-value = 3.151e-16
## alternative hypothesis: true mean is greater than 50
## 95 percent confidence interval:
## 53.58221 Inf
## sample estimates:
## mean of x
## 54.43894
Let’s look at some of the outputs in this function:
Remark on the \(t\) distribution: A \(t\) distribution with 1 DOF has much heavier tail than the normal distribution, hence is more “spread out”. You can notice from the plot below that as the degrees of freedom increases, the distribution becomes closer and closer to a standard normal distribution.
x = seq(-5, 5, 0.01)
den = data.frame("x" = x, "y" = dnorm(x), "dist" = "norm")
den = rbind(den, data.frame("x" = x, "y" = dt(x, df = 1), "dist" = "t(1)"))
den = rbind(den, data.frame("x" = x, "y" = dt(x, df = 2), "dist" = "t(3)"))
den = rbind(den, data.frame("x" = x, "y" = dt(x, df = 6), "dist" = "t(6)"))
ggplot(den, aes(x=x, y=y, group=dist)) +
geom_line(aes(linetype=dist, color=dist), size = 1.5) +
scale_linetype_manual(values=c("solid", "solid", "twodash", "dotted")) +
scale_color_manual(values=c("darkorange", "deepskyblue", "deepskyblue", "deepskyblue")) +
theme(legend.key.size = unit(2, 'cm'))
Remark on the confidence interval: An interesting phenomenon about the sample mean \(\bar x\) is that, it will concentrate around the true population mean. And as the sample size increases, this concentration becomes tighter. This phenomenon is called the Central Limit Theorem. This is the foundation of almost all statistical research. In our context, the sample mean would eventfully varies around the true population mean, hence, the true population mean (unknown) should also be somewhere near the sample mean, as the sample size increases. And the amount of variation is quantified by \(\hat\sigma / \sqrt{n}\). Based on this fact, we can construct an interval that is wide enough to incorporate 95% of the uncertainty and draw inference of \(\mu\). The details of this is beyond this course.
Question: is the average age different across gender?
For this question, we can set up a new hypothesis testing problem:
\[H_0: \mu_1 = \mu_2 \quad \text{vs.} \quad H_1: \mu_1 \neq \mu_2,\] where \(\mu_1\) and \(\mu_2\) represent the two population means.
The idea is still the same: we will construct some test statistic (could be calculated from any observed samples), and this quantity would follow a known distribution, if the Null hypothesis is true. Hence after observing the actual samples, we evaluate if the observed version lies at the extreme end of the known Null distribution. It turns out that we can use the following statistic, if assuming that the variance of the two samples are the same:
\[t = \frac{\bar x_1 - \bar x_2}{\hat\sigma_p \sqrt{1/n_1 + 1/n_2}}\] Here, \(\bar x_1\) and \(\bar x_2\) are sample averages, \(n_1\) and \(n_2\) are the corresponding sample size, and \(\hat\sigma_p\) is a pooled standard deviation. However, this is NOT pooling all samples together then calculate the standard deviation. Details can be found here.
male_age = heart$age[heart$sex == 1]
female_age = heart$age[heart$sex == 0]
t.test(male_age, female_age, alternative = "two.sided", var.equal = TRUE)
##
## Two Sample t-test
##
## data: male_age and female_age
## t = -1.7004, df = 301, p-value = 0.09009
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.0701726 0.2967765
## sample estimates:
## mean of x mean of y
## 53.83495 55.72165
Of course, you can imaging if we do not assume equal variance, there will be another version of the test. They can be quite different if the two populations displays very different variance. Moreover, the degrees of freedom associated with this test may not be integers.
t.test(male_age, female_age, alternative = "two.sided", var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: male_age and female_age
## t = -1.6648, df = 178.57, p-value = 0.0977
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.1230253 0.3496293
## sample estimates:
## mean of x mean of y
## 53.83495 55.72165
Remark on the paired \(t\)-test. This is a special case, when each subject is observed twice and we want to know if the first measurement is different from the second measurement, on average within this population. This is very popular in clinical trials. A theoretical problem is that if we estimate the pooled variance, it can be very large, and lead to insignificance. The strategy is to take the different between the two measurements from the same subject, and treat these differences as a one-sample \(t\)-test using 0 as the Null hypothesis. The following code demonstrates this effect.
x1 = rnorm(100)
x2 = x1 + 0.2 + rnorm(100, sd = 0.1)
# without pairing
t.test(x1, x2, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: x1 and x2
## t = -1.5518, df = 197.9, p-value = 0.1223
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.46410388 0.05534425
## sample estimates:
## mean of x mean of y
## -0.10764767 0.09673214
# with pairing
t.test(x1, x2, alternative = "two.sided", paired = TRUE)
##
## Paired t-test
##
## data: x1 and x2
## t = -19.126, df = 99, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2255836 -0.1831760
## sample estimates:
## mean of the differences
## -0.2043798
Motivating question: is fasting blood sugar (fbs) prevalence different across different gender?
The challenge here is both variables are categorical. We can first use a contingency table to summarize the results:
heart$sex = factor(heart$sex)
levels(heart$sex) = c("female", "male")
heart$fbs = factor(heart$fbs)
levels(heart$fbs) = c("low", "high")
table(heart$sex, heart$fbs)
##
## low high
## female 85 12
## male 173 33
One way to test if fbs
is the same across different
gender is the Chi-squared (\(\chi^2\))
test of independence. Since if the two variables are independent, then
the prevalence would be the same regardless gender. The test is called
Chi-squared (\(\chi^2\)) test because
the test statistic (skipped here) approximately follows Chi-squared
distribution if we have a sample size large enough. The following code
can be used:
chisq.test(heart$sex, heart$fbs)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: heart$sex and heart$fbs
## X-squared = 0.43559, df = 1, p-value = 0.5093
An alternative way of doing this test is called the Fisher’s exact test. This test is used in our Lady tasting tea problem when Fisher originally proposed the first hypothesis testing procedure. However, the test can be inefficient computationally because it requires calculating very large combinatorial numbers when the sample size is large. Moreover, in many situations, when the sample size is sufficient, it gives similar results as the Chi-squared (\(\chi^2\)) test.
fisher.test(heart$sex, heart$fbs)
##
## Fisher's Exact Test for Count Data
##
## data: heart$sex and heart$fbs
## p-value = 0.4897
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.6400317 3.0214601
## sample estimates:
## odds ratio
## 1.349867
In all of the above examples, we assumed that we can only observe the samples once. Let’s assume that there are 100 researchers, and each of them would observe their own samples.
In genetic research, such as the genome-wide association studies, multiple testings and false discoveries are important issues.