Linear Regression

Overview

The document focus on introducing linear regression models. We will start with the correlation of coefficient, which is the most commonly used measure for describing the relationship between two continuous variables. We emphasis that correlation is different from independence, although there are always confusions about them in practice. Secondly, we will introduce the simple linear regression. This is essentially the same as fitting a correlation, but also include the intercept term. Using the lm() function is the focus here. And we will discuss several important outputs in a fitted model. Thirdly, we extend this to multiple linear regression. From a coding perspective, there is no difference. However, the mathematics behind it can be complicated. The important concept here is confounding variables and colinearity, which can always causing trouble in practice.

Correlation

Correlation is one of the most commonly used concept in statistics when analyzing the relationship between continuous variables. The most popular mathematical definition, Pearson correlation is named after Karl Pearson. However, the concept was proposed by Francis Galton in 1888. Galton studied the association between the average height of parents and the height of their adult child. The data can be loaded from the HistData package.

  library(HistData)
  data(GaltonFamilies)
  GaltonFamilies

Definition

The sample Pearson correlation between two variables, $X$ and $Y$, is defined as

\[ r_{xy} = \frac{\sum_i (x_i - \bar x)(y_i - \bar y)}{(n-1)\hat\sigma_x \hat\sigma_y }\] where the $\hat\sigma_x$ and $\hat\sigma_y$ are the unbiased sample standard deviations. Here are some explanation of this formula:

The numerator quantifies how much the two variables changes together (from their respective centers $\bar x$ and $\bar y$)
The denominator is the product of two standard deviations. You can think of this as pre-scaling the observed data by their standard deviations around their respective canters. This will also make the whole quantity to be within $[-1, 1]$.
When this correlation is -1 or 1, there is a perfect correlation. Mathematically, this implies that $x_i - \bar x$ equals $c \times (y_i - \bar y)$ for all subject $i$, with a constant $c$. This means that, once we center both variables, one variable is just a scale change from the other. And the sign of that scaling factor determines if its a negative or a positive correlation.

Calcucating and Interpreting the Sample Correlation

To calculate the Pearson correlation, we can use the cor() function

  cor(GaltonFamilies$midparentHeight, GaltonFamilies$childHeight)
## [1] 0.3209499
  plot(GaltonFamilies$midparentHeight, GaltonFamilies$childHeight, 
       xlab = "Average Parent Height", ylab = "Child Height",
       pch = 19, cex = 0.5, main = "Galton Data", col = "deepskyblue")
  legend("bottomright", "Correlation = 0.321", cex = 1.5)

As we can see, there is a positive trend how two variables move together. However, this correlation is not very strong. In practice, it is difficult to decide whether a correlation is strong or not. Cohen (1988) suggests that 0.1 is weak, 0.3 is moderate, while 0.5 is strong. However, this is a very field-specific concept. For example, Gignac and Szodorai (2016) reported that less than $3\%$ of correlations reported in the psychology literature were found to be as large as 0.5. In biomedical studies, it is also rare to see very strong correlation between a single biomarker and a certain disease. Hence, you need to make the judgment accordingly. The following plots of different correlation magnitude may give you an idea:

  library(MASS)
  set.seed(2)
  par(mfrow=c(2, 3)) # this sets multiple plots
  par(mar=c(1, 1, 2, 1)) # this sets margins for each of them
  corxy = c(-0.5, 0, 0.25, 0.5, 0.7, 0.9)
  for (i in 1:6)
  {
    # this code generates variables from a multivariate normal distribution with correlated variables
    xy = mvrnorm(500, mu = c(0, 0),
                 Sigma = matrix(c(1, corxy[i], corxy[i], 1), 2, 2))
    plot(xy[, 1], xy[, 2], xaxt='n', yaxt='n',
         pch = 19, cex = 0.5, col = "darkorange",
         main = paste("Correlation approx.", corxy[i]))
  }

Testing Significance

A statistical test can be used to determine if the population correlation coefficient is different from zero or not. Note that what we introduced before is the sample correlation coefficient, which can be calculated from a collected sample. However, this sample is only a representation of the entire population, which has a true correlation $\rho_{xy}$. Hence, we may be able to make inference about $\rho_{xy}$ from the collected samples. Formally, the hypothesis is

\[H_0: \rho_{xy} = 0 \quad \text{vs.} \quad H_1: \rho_{xy} \neq 0.\] In fact, the test statistic is very simple,

\[ t = \frac{r_{xy} \sqrt{n-2}}{\sqrt{1 - r_{xy}^2}},\]

which follows a $t$ distribution with $n-2$ degrees of freedom. This can be done completely with R.

  cor.test(GaltonFamilies$midparentHeight, GaltonFamilies$childHeight)
## 
##  Pearson's product-moment correlation
## 
## data:  GaltonFamilies$midparentHeight and GaltonFamilies$childHeight
## t = 10.345, df = 932, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2622011 0.3773285
## sample estimates:
##       cor 
## 0.3209499

The result turns out to be highly significant. Take a moment to think about what is the interpretation of a signification p-value. However, keep in mind that this is only testing if the true correlation is zero or not. It does not reflect whether the correlation is strong or not. Large sample size will often lead to higher significance.

Relationship with independence

Uncorrelated should be not confused with independence. This following example generates uncorrelated (theoretically) variables

  set.seed(1)
  x = rnorm(1000)
  y = x + rnorm(1000, sd = 0.5)
  z = y^2
  par(mfrow=c(1, 2))
  par(mar=c(4, 4, 1, 1))
  plot(x, y, pch = 19, cex = 0.2)
  plot(x, z, pch = 19, cex = 0.2)

  cor(x, y)
## [1] 0.8941315
  cor(x, z)
## [1] -0.007100812

Its easy to see that although $X$ and $Y$ have very high correlation (0.899), the sample correlation between $X$ and $Z$ is almost zero (-0.022). However, we cannot say that $x$ and $z$ are independent. There is a precise definition of independence in statistics by stating the joint distribution as the product of two marginal distributions. But that concept could be a bit difficult to explain without proper mathematical background. However, you can think about this conceptual understanding of independence:

If we restrict ourselves to a small range of $x$, and look at what is the distribution of $y$ given that $x$ is within that range, then that distribution (including mean, spread, etc) of $y$ is the same regardless of the specific range you considered for $x$.

This is clearly not the case between $x$ and $y$ (or $x$ and $z$) in our previous plot, since as $x$ changes, the mean of $y$ (or $z$) is shifting.

On the opposite side, if two variables are independent, then the correlation has to be 0. But keep in mind that this is a statement regarding the entire population, and the sample (possibly small) you collected will always have variations.

Other types of correlation measures

The Pearson correlation has some assumptions associated with it. One of the most crucial one is that both variables should be normally distributed. You may still use the Pearson correlation if there is a mild violation. However, you should consider rank based correlation, such as Spearman and Kendall’s $\tau$ in other cases. Another, relatively new, choice is the distance correlation. It is able to handle more complex dependencies. This can be calculated using the energy package.

Simple Linear Regression

While the correlation between two variables can describe their association, it does not provide a tool to predict one variable from another. Linear regression can be used for that purpose. Here is an example of skin cancer data. This data was collected in 1950 when there were 48 states (without Alaska and Hawaii), and Washington, D.C. was included as a 49th state. We plot the latitude against the mortality rate (number of deaths per 10 million people) of these 49 observations.

    skincancer = read.table("skin.txt", header = TRUE)
    par(mar = c(4,4,0.5,0.5))
    plot(Mort ~ Lat, data = skincancer, pch = 19, col = "deepskyblue",
         xlab = "Latitude", ylab = "Mortality Rate")

We can see that their is an association between the two, and can be approximated by a linear relationship. If we need to place a line on this plot to describe this linear relationship, what angle and intercept should that line have? How do we define the optimal line using the samples collected?

  par(mar = c(4,4,0.5,0.5))
  plot(Mort ~ Lat, data = skincancer, pch = 19, col = "deepskyblue",
       xlab = "Latitude", ylab = "Mortality Rate")
  
  # some random lines?
  abline(350, -5, col = "darkorange", lwd = 2)
  abline(420, -7, col = "darkorange", lwd = 2)

The Optimal Line

Legendre proposed a method called the least square approach for solving linear regression systems. A simple linear regression concerns modeling the relationship

\[Y = \beta_0 + \beta_1 X + \epsilon.\] Here $X$ is called a covariate (or independent variable) and $Y$ is called a outcome variable (or response, dependent variable). $\epsilon$ is used here to absorb any unexplained effect, they are also called the random noise, or random error.

Once we collect a set of observations $\{x_i, y_i\}$, $i = 1, \ldots, n$, the least square approach means that we want to find a line to minimize the following objective function:

\[\{\hat \beta_0, \hat \beta_1\} = \underset{\beta_0, \beta_1}{\arg\min} \frac{1}{n} \sum_{i=1}^n \big( y_i - \beta_0 - \beta_1 x_i \big)^2\] It is OK to not be familiar with this notation (which is about numerical optimization), but the term $y_i - \beta_0 - \beta_1 x_i$ means how $y_i$ deviates from the line you defined, $\beta_0 - \beta_1 x_i$, for subject $i$, and sum over all the squares of those residuals. You can visualize these residual terms in the following plot:

By altering the intercept and slope of this fitting line, we want to achieve the smallest sum of square errors. Interestingly, there is an analytic solution to this problem. The optimal parameter estimates can be calculated using the formula:

\[\widehat \beta_1 = \frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2}\] \[\widehat \beta_0 = \bar y - \widehat \beta_1 \bar x \] Of course, this can be performed in R.

Fitting Simple Linear Regressions

The lm() function is already included in the base R. Hence, you can directly fit the model.

    # fitting a linear regression
    fit = lm(Mort ~ Lat, data = skincancer)

Please be aware of the syntax here:

Usually you specify the dataset using data =. The data contains properly named columns. In this case, the columns are Mort and Lat. To check the names of a data, you may use colnames() function.
The ~ sign separates the covariate on the right hand side and the outcome variable on the left hand side.

Now we can check the model fitting:

    summary(fit)
## 
## Call:
## lm(formula = Mort ~ Lat, data = skincancer)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.972 -13.185   0.972  12.006  43.938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 389.1894    23.8123   16.34  < 2e-16 ***
## Lat          -5.9776     0.5984   -9.99 3.31e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.12 on 47 degrees of freedom
## Multiple R-squared:  0.6798, Adjusted R-squared:  0.673 
## F-statistic:  99.8 on 1 and 47 DF,  p-value: 3.309e-13

The summary provides many useful information:

Call restates the model specification.
Residuals is a summary of all $e_i = y_i - \hat\beta_0 - \hat\beta_1 x_i$ terms. We usually do not use them unless for model diagnostics.
Coefficients contains all estimated parameter and their corresponding p-values (in the last column Pr(>|t|)) if testing if it is zero (Null hypothesis) or not (Alternative). For example, Lat (latitude) is a highly significant variable. Each unit increase of Lat would increase Mort by
Residual standard error is another summary of the residuals. Here the degrees of freedom refers to the estimation of the variance of residuals. We have seen this in the $t$-test previously.
Multiple R-squared measures the proportion of the variance in the response variable $Y$ of this regression model that can be explained by the predictor ($X$). It is also the squared version of a quantity called Multiple R, which is the Pearson correlation between the observed $Y$ values and the fitted regression line $\hat\beta_0 + \hat\beta_1 x_i$. This quantity ranges between 0 and 1, with smaller values being a very poor fitting and 1 means a perfect fitting.
Adjusted R-squared is a modified version of Multiple R-squared since this previous one will only increase as the number of variables in the model increases, regardless of whether they are useful or not. The Adjusted R-squared is defined as $1 - \frac{(1-R^2)(n-1)}{n-p-1}$ which takes the number of predictors ($p$) into account. We will see more of this in the multiple linear regression part.
F-statistic and the associated p-value refers to an overall test of this model. This means that we are jointly testing if all of the coefficients are zero. Note that this quantity is usually highly significant because it combines the effect of all varaibles. This is related to an ANOVA (analysis of variance) which can be used to jointly test if several variables are significant overall. However, that is beyond the scope of this class.

To obtain some detailed results, you can further extract results from the fitted R object (fit). This is performed by a $ after that. The following code shows some of them.

  # the fitted values
  fit$fitted.values[1:5]
##        1        2        3        4        5 
## 191.9274 182.9609 179.9721 165.0280 156.0616

  # the regression coefficients
  fit$coefficients
## (Intercept)         Lat 
##  389.189351   -5.977636

Predicting a New Subject

With the calculated coefficients, it is very easy to predict a new subject. For example, if we are interested in a location with Lat$= 40$, then we would simply perform

\[389.189351 - 5.977636 \times 40 = 150.0839\] which predict that the mortality rate is 150.0839 per 10 million people. We can also perform this prediction using pre-defined R functions. However, this requires setting up a new data frame with the same structure as the training data:

  testdata = data.frame("Lat" = 40)
  testdata

  predict(fit, newdata = testdata)
##        1 
## 150.0839

Multiple Linear Regression

Multiple linear regression is essentially adding more predictors. A mathematical representation is

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon.\]

And the idea of the optimization is also similar that, we want to minimize the squared errors given a set of observed data.

\[\underset{\beta_0, \beta_1, \ldots, \beta_p}{\arg\min} \frac{1}{n} \sum_{i=1}^n \big( y_i - \beta_0 - \beta_1 x_{i1} - \beta_2 x_{i2} \cdots - \beta_2 x_{ip}\big)^2\]

It is not straight forward to solve these parameters when given a set of data. This involves applications of some linear algebra:

\[\widehat{\boldsymbol \beta}= (\mathbf{X}^\text{T} \mathbf{X})^{-1} \mathbf{X}^\text{T}\mathbf{y}\]

This can be done using the lm() function. For example, if we want to include the longitude in the skin cancer data, we can specify two variables using the + sign in the right hand side of ~:

    skincancer = read.table("skin.txt", header = TRUE)
    # fitting a multiple linear regression
    fit = lm(Mort~ Lat + Long, data = skincancer)
    summary(fit)
## 
## Call:
## lm(formula = Mort ~ Lat + Long, data = skincancer)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.551 -13.525  -0.757  14.055  41.700 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 400.6755    28.0512  14.284  < 2e-16 ***
## Lat          -5.9308     0.6038  -9.822 7.18e-13 ***
## Long         -0.1467     0.1873  -0.783    0.438    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.19 on 46 degrees of freedom
## Multiple R-squared:  0.684,  Adjusted R-squared:  0.6703 
## F-statistic: 49.79 on 2 and 46 DF,  p-value: 3.101e-12

Most of the quantities in this output have similar meanings as a simple linear regression. Except that we have a new variable longitude. However, longitude is not a significant variable. In most cases, you could remove it from the model.

Predictions can also be made using the predict() function by specifying one or more data points. You can also do this by hand by simply plugging in the estimated coefficients to the linear equation.

  testdata = data.frame("Lat" = c(40, 45), "Long" = c(80, 100))
  testdata

  
  # predict this pre-specified testing data
  predict(fit, testdata)
##        1        2 
## 151.7097 119.1224
  
  # predict the state of Illinois, which is the 12th row
  predict(fit, skincancer[12, ])
##       12 
## 150.3165

Categorical Predictors

The Ocean variable is binary that indicates if a state is next to an ocean. We often call this type of variables (with two categories) a dummy variable, and treat them as factor in R. The following code specifies a model with a categorical variable with two levels.

    fit = lm(Mort~ Lat + Long + as.factor(Ocean), data = skincancer)
    summary(fit)
## 
## Call:
## lm(formula = Mort ~ Lat + Long + as.factor(Ocean), data = skincancer)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.171 -10.070  -2.607   8.781  41.805 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       349.2369    27.0596  12.906  < 2e-16 ***
## Lat                -5.4950     0.5289 -10.390 1.55e-13 ***
## Long                0.1219     0.1732   0.704 0.485245    
## as.factor(Ocean)1  21.7976     5.2263   4.171 0.000137 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.48 on 45 degrees of freedom
## Multiple R-squared:  0.7721, Adjusted R-squared:  0.7569 
## F-statistic: 50.83 on 3 and 45 DF,  p-value: 1.7e-14

You can notice that in the output, there is a new specification of the parameter estimate, as.factor(Ocean)1. This means that the parameter is for estimating the effect of Ocean$= 1$. You may wonder where is the effect of category 0. Since this variable has only two categories, we only need one parameter to describe its effect. WhenOcean$= 0$, we can simply remove the contribution of this parameter. You can validate the result using the following testing data:

  testdata = data.frame("Lat" = c(40, 40), 
                        "Long" = c(100, 100),
                        "Ocean" = c(0, 1))
  testdata

  
  # predict and compare the two results
  predict(fit, testdata)
##        1        2 
## 141.6275 163.4251

Specifying Higher Order Terms

Sometimes a linear relationship is not sufficient to model the relationship, and we want to include higher order terms and also interactions to explain the effect. For example, we could add the squared term of latitude and the iteration between latitude and longitude to the model.

    fit = lm(Mort~ Lat + Long + as.factor(Ocean) + I(Lat^2) + I(Lat*Long), data = skincancer)
    summary(fit)
## 
## Call:
## lm(formula = Mort ~ Lat + Long + as.factor(Ocean) + I(Lat^2) + 
##     I(Lat * Long), data = skincancer)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.055 -11.770  -2.746  10.003  38.654 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       167.29775  192.50086   0.869    0.390    
## Lat                 5.55593    8.01734   0.693    0.492    
## Long               -0.57828    1.87076  -0.309    0.759    
## as.factor(Ocean)1  24.39569    5.55126   4.395 7.14e-05 ***
## I(Lat^2)           -0.16444    0.11192  -1.469    0.149    
## I(Lat * Long)       0.01888    0.04533   0.416    0.679    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.44 on 43 degrees of freedom
## Multiple R-squared:  0.7832, Adjusted R-squared:  0.758 
## F-statistic: 31.07 on 5 and 43 DF,  p-value: 3.059e-13

The prediction can be done using the same dataset we specified before:

  predict(fit, testdata)
##        1        2 
## 144.1132 168.5089

Do you notice the change of significance? What do you think is the cause? Which variable(s) would you like to remove/keep?

Collinearity

Collinearity refers to the situation that the covariates are (almost) linearly dependent. This is a very severe problem for linear regression. Let’s look at an example where we have $Y = X_1 + X_2 + \epsilon$, and variables $X_1$ and $X_2$ are highly correlated with each other. The true parameter should be 1 and 1, respectively. Since we know the true relationship, both variables should be highly significant. However, this is not the case while we fit the regression model. Moreover, the parameter estimates become 6.36779 and -4.34668, which is awfully wrong. However, their difference is almost 2, and this model explained more than 75% of the variation. Can you think of why?

  set.seed(1)
  x1 = rnorm(100)
  x2 = x1 + rnorm(100, sd = 0.01)
  y = x1 + x2 + rnorm(100)
  fit = lm(y~ x1 + x2)
  summary(fit)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.94359 -0.43645  0.00202  0.63692  2.63941 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.02535    0.10519   0.241    0.810
## x1           6.36779   10.94836   0.582    0.562
## x2          -4.34668   10.94785  -0.397    0.692
## 
## Residual standard error: 1.043 on 97 degrees of freedom
## Multiple R-squared:  0.7556, Adjusted R-squared:  0.7505 
## F-statistic: 149.9 on 2 and 97 DF,  p-value: < 2.2e-16

On the other hand, if we create independent covariates with exactly the same model, we see the expected results.

  set.seed(1)
  x1 = rnorm(100)
  x2 = rnorm(100)
  y = x1 + x2 + rnorm(100)
  fit = lm(y~ x1 + x2)
  summary(fit)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.94359 -0.43645  0.00202  0.63692  2.63941 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02535    0.10519   0.241     0.81    
## x1           1.02111    0.11675   8.746 6.81e-14 ***
## x2           0.94653    0.10948   8.646 1.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.043 on 97 degrees of freedom
## Multiple R-squared:  0.609,  Adjusted R-squared:  0.601 
## F-statistic: 75.55 on 2 and 97 DF,  p-value: < 2.2e-16

In practice, you may want to eliminate variables with “almost the same definition”, which can be highly correlated. However, this is inevitable if we have a large number of variables, i.e., $p$ is large. And eventually, when $p$ is close or larger than the sample size $n$, a linear regression model should not be used. There are several ways to handle this situation, for example, a simple strategy called the step-wise regression could be used. It simply adding or subtracting variables from the model one-by-one based on the p-value. When $p$ is too large, you may consider the Ridge or Lasso regression.

Confunding Variables

Let’s imaging a relationship in which $X$ (e.g. a genetic factor) causes both $Z$ (a symptom) and $Y$ (another symptom). This can be demonstrated using the following code:

  set.seed(1)
  X = rnorm(100)
  Z = X + rnorm(100, sd = 0.5)
  Y = X + rnorm(100, sd = 0.5)

If we simply regress $Y$ on $Z$, we may conclude that $Z$ would increase $Y$, as $Z$ is highly significant. However, this is not really the true relationship.

  summary(lm(Y~Z))$coefficients
##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 0.05396496 0.06910713  0.7808884 4.367504e-01
## Z           0.77524559 0.06799321 11.4018082 1.135413e-19

In this case, we should also control for the effect of $X$, and in this case, $Z$ is not significant anymore. This means, given the information of $X$, $Z$ does not affect $Y$.

  summary(lm(Y~Z+X))$coefficients
##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)  0.01267671 0.05259704  0.2410157 8.100516e-01
## Z           -0.05346682 0.10947855 -0.4883771 6.263850e-01
## X            1.06402190 0.12401872  8.5795268 1.551643e-13

In this case, $X$ is a confounding variable. Not including appropriate confounding variables could often lead to wrong conclusions. Commonly used confounding variables are age, gender and race. However, you should always decide that based on the specific study. Here are some things to consider:

Confounding is a causal concept and most statistical methods cannot directly claim causal effect. Whether the result you found is a causal effect should mainly relying on the domain knowledge.
Confounding variables, if included in the model, should stay there even they are not significant.