The document focus on introducing linear regression models. We will
start with the correlation of coefficient, which is the most commonly
used measure for describing the relationship between two continuous
variables. We emphasis that correlation is different from independence,
although there are always confusions about them in practice. Secondly,
we will introduce the simple linear regression. This is essentially the
same as fitting a correlation, but also include the intercept term.
Using the lm()
function is the focus here. And we will
discuss several important outputs in a fitted model. Thirdly, we extend
this to multiple linear regression. From a coding perspective, there is
no difference. However, the mathematics behind it can be complicated.
The important concept here is confounding variables and colinearity,
which can always causing trouble in practice.
Correlation is one of the most commonly used concept in statistics
when analyzing the relationship between continuous variables. The most
popular mathematical definition, Pearson
correlation is named after Karl Pearson.
However, the concept was proposed by Francis Galton
in 1888. Galton studied the association between the average height of
parents and the height of their adult child. The data can be loaded from
the HistData
package.
library(HistData)
data(GaltonFamilies)
GaltonFamilies
The sample Pearson correlation between two variables, \(X\) and \(Y\), is defined as
\[ r_{xy} = \frac{\sum_i (x_i - \bar x)(y_i - \bar y)}{(n-1)\hat\sigma_x \hat\sigma_y }\] where the \(\hat\sigma_x\) and \(\hat\sigma_y\) are the unbiased sample standard deviations. Here are some explanation of this formula:
To calculate the Pearson correlation, we can use the
cor()
function
cor(GaltonFamilies$midparentHeight, GaltonFamilies$childHeight)
## [1] 0.3209499
plot(GaltonFamilies$midparentHeight, GaltonFamilies$childHeight,
xlab = "Average Parent Height", ylab = "Child Height",
pch = 19, cex = 0.5, main = "Galton Data", col = "deepskyblue")
legend("bottomright", "Correlation = 0.321", cex = 1.5)
As we can see, there is a positive trend how two variables move together. However, this correlation is not very strong. In practice, it is difficult to decide whether a correlation is strong or not. Cohen (1988) suggests that 0.1 is weak, 0.3 is moderate, while 0.5 is strong. However, this is a very field-specific concept. For example, Gignac and Szodorai (2016) reported that less than \(3\%\) of correlations reported in the psychology literature were found to be as large as 0.5. In biomedical studies, it is also rare to see very strong correlation between a single biomarker and a certain disease. Hence, you need to make the judgment accordingly. The following plots of different correlation magnitude may give you an idea:
library(MASS)
set.seed(2)
par(mfrow=c(2, 3)) # this sets multiple plots
par(mar=c(1, 1, 2, 1)) # this sets margins for each of them
corxy = c(-0.5, 0, 0.25, 0.5, 0.7, 0.9)
for (i in 1:6)
{
# this code generates variables from a multivariate normal distribution with correlated variables
xy = mvrnorm(500, mu = c(0, 0),
Sigma = matrix(c(1, corxy[i], corxy[i], 1), 2, 2))
plot(xy[, 1], xy[, 2], xaxt='n', yaxt='n',
pch = 19, cex = 0.5, col = "darkorange",
main = paste("Correlation approx.", corxy[i]))
}
A statistical test can be used to determine if the population correlation coefficient is different from zero or not. Note that what we introduced before is the sample correlation coefficient, which can be calculated from a collected sample. However, this sample is only a representation of the entire population, which has a true correlation \(\rho_{xy}\). Hence, we may be able to make inference about \(\rho_{xy}\) from the collected samples. Formally, the hypothesis is
\[H_0: \rho_{xy} = 0 \quad \text{vs.} \quad H_1: \rho_{xy} \neq 0.\] In fact, the test statistic is very simple,
\[ t = \frac{r_{xy} \sqrt{n-2}}{\sqrt{1 - r_{xy}^2}},\]
which follows a \(t\) distribution
with \(n-2\) degrees of freedom. This
can be done completely with R
.
cor.test(GaltonFamilies$midparentHeight, GaltonFamilies$childHeight)
##
## Pearson's product-moment correlation
##
## data: GaltonFamilies$midparentHeight and GaltonFamilies$childHeight
## t = 10.345, df = 932, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2622011 0.3773285
## sample estimates:
## cor
## 0.3209499
The result turns out to be highly significant. Take a moment to think about what is the interpretation of a signification p-value. However, keep in mind that this is only testing if the true correlation is zero or not. It does not reflect whether the correlation is strong or not. Large sample size will often lead to higher significance.
Uncorrelated should be not confused with independence. This following example generates uncorrelated (theoretically) variables
set.seed(1)
x = rnorm(1000)
y = x + rnorm(1000, sd = 0.5)
z = y^2
par(mfrow=c(1, 2))
par(mar=c(4, 4, 1, 1))
plot(x, y, pch = 19, cex = 0.2)
plot(x, z, pch = 19, cex = 0.2)
cor(x, y)
## [1] 0.8941315
cor(x, z)
## [1] -0.007100812
Its easy to see that although \(X\) and \(Y\) have very high correlation (0.899), the sample correlation between \(X\) and \(Z\) is almost zero (-0.022). However, we cannot say that \(x\) and \(z\) are independent. There is a precise definition of independence in statistics by stating the joint distribution as the product of two marginal distributions. But that concept could be a bit difficult to explain without proper mathematical background. However, you can think about this conceptual understanding of independence:
This is clearly not the case between \(x\) and \(y\) (or \(x\) and \(z\)) in our previous plot, since as \(x\) changes, the mean of \(y\) (or \(z\)) is shifting.
On the opposite side, if two variables are independent, then the correlation has to be 0. But keep in mind that this is a statement regarding the entire population, and the sample (possibly small) you collected will always have variations.
The Pearson correlation has some assumptions associated with it. One
of the most crucial one is that both variables should be normally
distributed. You may still use the Pearson correlation if there is a
mild violation. However, you should consider rank based correlation,
such as Spearman
and Kendall’s
\(\tau\) in other cases. Another,
relatively new, choice is the distance
correlation. It is able to handle more complex dependencies. This
can be calculated using the energy
package.
While the correlation between two variables can describe their association, it does not provide a tool to predict one variable from another. Linear regression can be used for that purpose. Here is an example of skin cancer data. This data was collected in 1950 when there were 48 states (without Alaska and Hawaii), and Washington, D.C. was included as a 49th state. We plot the latitude against the mortality rate (number of deaths per 10 million people) of these 49 observations.
skincancer = read.table("skin.txt", header = TRUE)
par(mar = c(4,4,0.5,0.5))
plot(Mort ~ Lat, data = skincancer, pch = 19, col = "deepskyblue",
xlab = "Latitude", ylab = "Mortality Rate")
We can see that their is an association between the two, and can be approximated by a linear relationship. If we need to place a line on this plot to describe this linear relationship, what angle and intercept should that line have? How do we define the optimal line using the samples collected?
par(mar = c(4,4,0.5,0.5))
plot(Mort ~ Lat, data = skincancer, pch = 19, col = "deepskyblue",
xlab = "Latitude", ylab = "Mortality Rate")
# some random lines?
abline(350, -5, col = "darkorange", lwd = 2)
abline(420, -7, col = "darkorange", lwd = 2)
Legendre proposed a method called the least square approach for solving linear regression systems. A simple linear regression concerns modeling the relationship
\[Y = \beta_0 + \beta_1 X + \epsilon.\] Here \(X\) is called a covariate (or independent variable) and \(Y\) is called a outcome variable (or response, dependent variable). \(\epsilon\) is used here to absorb any unexplained effect, they are also called the random noise, or random error.
Once we collect a set of observations \(\{x_i, y_i\}\), \(i = 1, \ldots, n\), the least square approach means that we want to find a line to minimize the following objective function:
\[\{\hat \beta_0, \hat \beta_1\} = \underset{\beta_0, \beta_1}{\arg\min} \frac{1}{n} \sum_{i=1}^n \big( y_i - \beta_0 - \beta_1 x_i \big)^2\] It is OK to not be familiar with this notation (which is about numerical optimization), but the term \(y_i - \beta_0 - \beta_1 x_i\) means how \(y_i\) deviates from the line you defined, \(\beta_0 - \beta_1 x_i\), for subject \(i\), and sum over all the squares of those residuals. You can visualize these residual terms in the following plot:
By altering the intercept and slope of this fitting line, we want to achieve the smallest sum of square errors. Interestingly, there is an analytic solution to this problem. The optimal parameter estimates can be calculated using the formula:
\[\widehat \beta_1 = \frac{\sum_{i=1}^n
(x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2}\]
\[\widehat \beta_0 = \bar y - \widehat
\beta_1 \bar x \] Of course, this can be performed in
R
.
The lm()
function is already included in the base
R
. Hence, you can directly fit the model.
# fitting a linear regression
fit = lm(Mort ~ Lat, data = skincancer)
Please be aware of the syntax here:
data =
. The data
contains properly named columns. In this case, the columns are
Mort
and Lat
. To check the names of a data,
you may use colnames()
function.~
sign separates the covariate on the right hand
side and the outcome variable on the left hand side.Now we can check the model fitting:
summary(fit)
##
## Call:
## lm(formula = Mort ~ Lat, data = skincancer)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.972 -13.185 0.972 12.006 43.938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 389.1894 23.8123 16.34 < 2e-16 ***
## Lat -5.9776 0.5984 -9.99 3.31e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.12 on 47 degrees of freedom
## Multiple R-squared: 0.6798, Adjusted R-squared: 0.673
## F-statistic: 99.8 on 1 and 47 DF, p-value: 3.309e-13
The summary provides many useful information:
Call
restates the model specification.Residuals
is a summary of all \(e_i = y_i - \hat\beta_0 - \hat\beta_1 x_i\)
terms. We usually do not use them unless for model diagnostics.Coefficients
contains all estimated parameter and their
corresponding p-values (in the last column Pr(>|t|)
) if
testing if it is zero (Null hypothesis) or not (Alternative). For
example, Lat
(latitude) is a highly significant variable.
Each unit increase of Lat
would increase Mort
byResidual standard error
is another summary of the
residuals. Here the degrees of freedom refers to the estimation of the
variance of residuals. We have seen this in the \(t\)-test previously.Multiple R-squared
measures the proportion of the
variance in the response variable \(Y\)
of this regression model that can be explained by the predictor (\(X\)). It is also the squared version of a
quantity called Multiple R
, which is the Pearson
correlation between the observed \(Y\)
values and the fitted regression line \(\hat\beta_0 + \hat\beta_1 x_i\). This
quantity ranges between 0 and 1, with smaller values being a very poor
fitting and 1 means a perfect fitting.Adjusted R-squared
is a modified version of
Multiple R-squared
since this previous one will only
increase as the number of variables in the model increases, regardless
of whether they are useful or not. The Adjusted R-squared
is defined as \(1 -
\frac{(1-R^2)(n-1)}{n-p-1}\) which takes the number of predictors
(\(p\)) into account. We will see more
of this in the multiple linear regression part.F-statistic
and the associated p-value refers to an
overall test of this model. This means that we are jointly testing if
all of the coefficients are zero. Note that this quantity is usually
highly significant because it combines the effect of all varaibles. This
is related to an ANOVA
(analysis of variance) which can be used to jointly test if several
variables are significant overall. However, that is beyond the scope of
this class.To obtain some detailed results, you can further extract results from
the fitted R
object (fit
). This is performed
by a $
after that. The following code shows some of
them.
# the fitted values
fit$fitted.values[1:5]
## 1 2 3 4 5
## 191.9274 182.9609 179.9721 165.0280 156.0616
# the regression coefficients
fit$coefficients
## (Intercept) Lat
## 389.189351 -5.977636
With the calculated coefficients, it is very easy to predict a new
subject. For example, if we are interested in a location with
Lat
\(= 40\), then we would
simply perform
\[389.189351 - 5.977636 \times 40 =
150.0839\] which predict that the mortality rate is 150.0839 per
10 million people. We can also perform this prediction using pre-defined
R
functions. However, this requires setting up a new data
frame with the same structure as the training data:
testdata = data.frame("Lat" = 40)
testdata
predict(fit, newdata = testdata)
## 1
## 150.0839
Multiple linear regression is essentially adding more predictors. A mathematical representation is
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon.\]
And the idea of the optimization is also similar that, we want to minimize the squared errors given a set of observed data.
\[\underset{\beta_0, \beta_1, \ldots, \beta_p}{\arg\min} \frac{1}{n} \sum_{i=1}^n \big( y_i - \beta_0 - \beta_1 x_{i1} - \beta_2 x_{i2} \cdots - \beta_2 x_{ip}\big)^2\]
It is not straight forward to solve these parameters when given a set of data. This involves applications of some linear algebra:
\[\widehat{\boldsymbol \beta}= (\mathbf{X}^\text{T} \mathbf{X})^{-1} \mathbf{X}^\text{T}\mathbf{y}\]
This can be done using the lm()
function. For example,
if we want to include the longitude in the skin cancer data, we can
specify two variables using the +
sign in the right hand
side of ~
:
skincancer = read.table("skin.txt", header = TRUE)
# fitting a multiple linear regression
fit = lm(Mort~ Lat + Long, data = skincancer)
summary(fit)
##
## Call:
## lm(formula = Mort ~ Lat + Long, data = skincancer)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.551 -13.525 -0.757 14.055 41.700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 400.6755 28.0512 14.284 < 2e-16 ***
## Lat -5.9308 0.6038 -9.822 7.18e-13 ***
## Long -0.1467 0.1873 -0.783 0.438
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.19 on 46 degrees of freedom
## Multiple R-squared: 0.684, Adjusted R-squared: 0.6703
## F-statistic: 49.79 on 2 and 46 DF, p-value: 3.101e-12
Most of the quantities in this output have similar meanings as a simple linear regression. Except that we have a new variable longitude. However, longitude is not a significant variable. In most cases, you could remove it from the model.
Predictions can also be made using the predict()
function by specifying one or more data points. You can also do this by
hand by simply plugging in the estimated coefficients to the linear
equation.
testdata = data.frame("Lat" = c(40, 45), "Long" = c(80, 100))
testdata
# predict this pre-specified testing data
predict(fit, testdata)
## 1 2
## 151.7097 119.1224
# predict the state of Illinois, which is the 12th row
predict(fit, skincancer[12, ])
## 12
## 150.3165
The Ocean
variable is binary that indicates if a state
is next to an ocean. We often call this type of variables (with two
categories) a dummy variable, and treat them as factor
in
R
. The following code specifies a model with a categorical
variable with two levels.
fit = lm(Mort~ Lat + Long + as.factor(Ocean), data = skincancer)
summary(fit)
##
## Call:
## lm(formula = Mort ~ Lat + Long + as.factor(Ocean), data = skincancer)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.171 -10.070 -2.607 8.781 41.805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 349.2369 27.0596 12.906 < 2e-16 ***
## Lat -5.4950 0.5289 -10.390 1.55e-13 ***
## Long 0.1219 0.1732 0.704 0.485245
## as.factor(Ocean)1 21.7976 5.2263 4.171 0.000137 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.48 on 45 degrees of freedom
## Multiple R-squared: 0.7721, Adjusted R-squared: 0.7569
## F-statistic: 50.83 on 3 and 45 DF, p-value: 1.7e-14
You can notice that in the output, there is a new specification of
the parameter estimate, as.factor(Ocean)1
. This means that
the parameter is for estimating the effect of Ocean
\(= 1\). You may wonder where is the effect
of category 0. Since this variable has only two categories, we only need
one parameter to describe its effect. WhenOcean
\(= 0\), we can simply remove the
contribution of this parameter. You can validate the result using the
following testing data:
testdata = data.frame("Lat" = c(40, 40),
"Long" = c(100, 100),
"Ocean" = c(0, 1))
testdata
# predict and compare the two results
predict(fit, testdata)
## 1 2
## 141.6275 163.4251
Sometimes a linear relationship is not sufficient to model the relationship, and we want to include higher order terms and also interactions to explain the effect. For example, we could add the squared term of latitude and the iteration between latitude and longitude to the model.
fit = lm(Mort~ Lat + Long + as.factor(Ocean) + I(Lat^2) + I(Lat*Long), data = skincancer)
summary(fit)
##
## Call:
## lm(formula = Mort ~ Lat + Long + as.factor(Ocean) + I(Lat^2) +
## I(Lat * Long), data = skincancer)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.055 -11.770 -2.746 10.003 38.654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 167.29775 192.50086 0.869 0.390
## Lat 5.55593 8.01734 0.693 0.492
## Long -0.57828 1.87076 -0.309 0.759
## as.factor(Ocean)1 24.39569 5.55126 4.395 7.14e-05 ***
## I(Lat^2) -0.16444 0.11192 -1.469 0.149
## I(Lat * Long) 0.01888 0.04533 0.416 0.679
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.44 on 43 degrees of freedom
## Multiple R-squared: 0.7832, Adjusted R-squared: 0.758
## F-statistic: 31.07 on 5 and 43 DF, p-value: 3.059e-13
The prediction can be done using the same dataset we specified before:
predict(fit, testdata)
## 1 2
## 144.1132 168.5089
Do you notice the change of significance? What do you think is the cause? Which variable(s) would you like to remove/keep?
Collinearity refers to the situation that the covariates are (almost) linearly dependent. This is a very severe problem for linear regression. Let’s look at an example where we have \(Y = X_1 + X_2 + \epsilon\), and variables \(X_1\) and \(X_2\) are highly correlated with each other. The true parameter should be 1 and 1, respectively. Since we know the true relationship, both variables should be highly significant. However, this is not the case while we fit the regression model. Moreover, the parameter estimates become 6.36779 and -4.34668, which is awfully wrong. However, their difference is almost 2, and this model explained more than 75% of the variation. Can you think of why?
set.seed(1)
x1 = rnorm(100)
x2 = x1 + rnorm(100, sd = 0.01)
y = x1 + x2 + rnorm(100)
fit = lm(y~ x1 + x2)
summary(fit)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.94359 -0.43645 0.00202 0.63692 2.63941
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02535 0.10519 0.241 0.810
## x1 6.36779 10.94836 0.582 0.562
## x2 -4.34668 10.94785 -0.397 0.692
##
## Residual standard error: 1.043 on 97 degrees of freedom
## Multiple R-squared: 0.7556, Adjusted R-squared: 0.7505
## F-statistic: 149.9 on 2 and 97 DF, p-value: < 2.2e-16
On the other hand, if we create independent covariates with exactly the same model, we see the expected results.
set.seed(1)
x1 = rnorm(100)
x2 = rnorm(100)
y = x1 + x2 + rnorm(100)
fit = lm(y~ x1 + x2)
summary(fit)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.94359 -0.43645 0.00202 0.63692 2.63941
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02535 0.10519 0.241 0.81
## x1 1.02111 0.11675 8.746 6.81e-14 ***
## x2 0.94653 0.10948 8.646 1.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.043 on 97 degrees of freedom
## Multiple R-squared: 0.609, Adjusted R-squared: 0.601
## F-statistic: 75.55 on 2 and 97 DF, p-value: < 2.2e-16
In practice, you may want to eliminate variables with “almost the same definition”, which can be highly correlated. However, this is inevitable if we have a large number of variables, i.e., \(p\) is large. And eventually, when \(p\) is close or larger than the sample size \(n\), a linear regression model should not be used. There are several ways to handle this situation, for example, a simple strategy called the step-wise regression could be used. It simply adding or subtracting variables from the model one-by-one based on the p-value. When \(p\) is too large, you may consider the Ridge or Lasso regression.
Let’s imaging a relationship in which \(X\) (e.g. a genetic factor) causes both \(Z\) (a symptom) and \(Y\) (another symptom). This can be demonstrated using the following code:
set.seed(1)
X = rnorm(100)
Z = X + rnorm(100, sd = 0.5)
Y = X + rnorm(100, sd = 0.5)
If we simply regress \(Y\) on \(Z\), we may conclude that \(Z\) would increase \(Y\), as \(Z\) is highly significant. However, this is not really the true relationship.
summary(lm(Y~Z))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.05396496 0.06910713 0.7808884 4.367504e-01
## Z 0.77524559 0.06799321 11.4018082 1.135413e-19
In this case, we should also control for the effect of \(X\), and in this case, \(Z\) is not significant anymore. This means, given the information of \(X\), \(Z\) does not affect \(Y\).
summary(lm(Y~Z+X))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01267671 0.05259704 0.2410157 8.100516e-01
## Z -0.05346682 0.10947855 -0.4883771 6.263850e-01
## X 1.06402190 0.12401872 8.5795268 1.551643e-13
In this case, \(X\) is a confounding variable. Not including appropriate confounding variables could often lead to wrong conclusions. Commonly used confounding variables are age, gender and race. However, you should always decide that based on the specific study. Here are some things to consider: