Basic Concepts

Suppose we collect a set of observations with design matrix \(\mathbf{X}\) and outcome \(\mathbf{y}\), linear regression estimate the coefficients through

\[ \widehat{\boldsymbol \beta} = \underset{\boldsymbol \beta}{\arg\min} \big( \mathbf y - \mathbf{X} \boldsymbol \beta \big)^\text{T} \big( \mathbf y - \mathbf{X} \boldsymbol \beta \big) \] This can be viewed as either a convex optimization problem or projections on the \(n\) dimensional vector space.

Linear regression as an optimization

    # generate data for a simple linear regression 
    set.seed(20)
    n = 100
    x <- cbind(1, rnorm(n))
    y <- x %*% c(1, 0.5) + rnorm(n)
    
    # calculate the residual sum of squares for a grid of beta values
    rss <- function(b, x, y) sum((y - x %*% b)^2)
    b1 <- b2 <- seq(0, 2, length= 20)
    z = matrix(apply(expand.grid(b1, b2), 1, rss, x, y), 20, 20)
    
    # 3d plot for RSS
    par(mar = c(1,1,3,1))
    persp(b1, b2, z, xlab = "beta 1", ylab = "beta 2", zlab = "RSS",
          main="Residual Sum of Squares", col = "springgreen", shade = 0.6,
          theta = 30, phi = 5)

    # The solution can be solved by any optimization algorithm 
    optim(c(0, 0), rss, x = x, y = y)$par
## [1] 1.088813 0.679870

Linear regression as projections

Another view is through projections in vector space. Consider each column of \(\mathbf{X}\) as a vector, and project \(\mathbf{y}\) onto the column space of \(\mathbf{X}\). The project is

\[ \widehat{\mathbf{y}} = \mathbf{X} (\mathbf{X}^\text{T} \mathbf{X})^{-1}\mathbf{X}^\text{T} \mathbf{y} \doteq {\mathbf{H}} \mathbf{y}, \] where \(\mathbf{H}\) is a projection matrix. And the residuals are simply

\[ \widehat{\mathbf{e}} = \mathbf{y} - \widehat{\mathbf{y}} = (\mathbf{I} - \mathbf{H}) \mathbf{y} \] When the number of variables is large, inverting \(\mathbf{X}^\text{T} \mathbf{X}\) is expansive. The R function lm() does not calculate the inverse directly. Instead, QR decomposition can be used. You can try a larger \(n\) and \(p\) to see a significant difference. This is only for demonstration. They are not required for our course.

    # generate 100 observations with 3 variables
    set.seed(1)
    n = 1000
    p = 500
    x = matrix(rnorm(n*p), n, p)
    X = cbind(1, x) # the design matrix, including 1 as the first column
    
    # define the true beta, the first entry is the intercept
    b = as.matrix(c(1, 1, 0.5, rep(0, p-2))) 
    
    # generate training y with Gaussian errors
    y = X %*% b + rnorm(n)
    
    # fit a linear regression model 
    lm.fit = lm(y ~ x)
    
    # look at the coefficients beta hat
    head(lm.fit$coef)
##  (Intercept)           x1           x2           x3           x4           x5 
##  1.016479750  1.026143517  0.496792668 -0.017272409  0.005193304  0.034639107
    # using normal equations by inverting the X'X matrix: b = (X'X)^-1 X'y 
    # however, this is very slow
    # check ?solve
    system.time({beta_hat = solve(t(X) %*% X) %*% t(X) %*% y})
##    user  system elapsed 
##    0.35    0.00    0.35
    head(beta_hat)
##              [,1]
## [1,]  1.016479750
## [2,]  1.026143517
## [3,]  0.496792668
## [4,] -0.017272409
## [5,]  0.005193304
## [6,]  0.034639107
    # you can avoid the inversion by specifying the linear equation system X'X b = X'y 
    system.time({beta_hat = solve(t(X) %*% X, t(X) %*% y)})    
##    user  system elapsed 
##    0.15    0.00    0.15
    # A better approach is to use QR decomposition or the Cholesky decomposition 
    # The following codes are not necessarily efficient, they are only for demonstration purpose
    
    # QR decomposition
    # direct calling the qr.coef function
    system.time({beta_hat = qr.coef(qr(X), y)})
##    user  system elapsed 
##    0.14    0.00    0.14
    # or 
    system.time({beta_hat = qr.solve(t(X) %*% X, t(X) %*% y)})
##    user  system elapsed 
##    0.17    0.00    0.17
    # if you want to see what Q and R are
    QR = qr(X)
    Q = qr.Q(QR)
    R = qr.R(QR)
    
    # get inverse of R, you can check R %*% R_inv yourself
    # the backsolve/forwardsolve functions can be used to solve AX = b for upper/lower triangular matrix A 
    # ?backsolve
    R_inv = backsolve(R, diag(p+1), upper.tri = TRUE, transpose = FALSE)
    beta_hat = R_inv %*% t(Q) %*% y
    
    # Cholesky Decomposition 
    
    # the chol function gives upper triangular matrix
    # crossprod(X) = X'X
    system.time({
    R = chol(crossprod(X))
    w = backsolve(R, t(X) %*% y, upper.tri = TRUE, transpose = TRUE)
    beta_hat = backsolve(R, w, upper.tri = TRUE, transpose = FALSE)
    })
##    user  system elapsed 
##    0.12    0.00    0.12
    # or equivalently 
    R = t(chol(crossprod(X)))
    w = forwardsolve(R, t(X) %*% y, upper.tri = FALSE, transpose = FALSE)
    beta_hat = forwardsolve(R, w, upper.tri = FALSE, transpose = TRUE) # the transpose = TRUE means that we are solving for R'b = w instead of Rb = w 

Model Selection: A Motivation Simulation

Does including all variables in a linear regression always a good choice? Based on our Marrow’s \(C_p\) analysis, it is not. We can validate this by using a simulation study. In this example, we generate data from a model, with variables of different effect size. Would it be beneficial to remove some variables?

  set.seed(1)
  n = 100
  p = 20
  
  nsim = 100
  allerrors = matrix(NA, nsim, p)
  
  for (i in 1:nsim)
  {
      # the design matrix
      x = matrix(rnorm(n*p), n, p)
      ytrain = x %*% 0.4^sqrt(c(1:p)) + rnorm(n)
      ytest = x %*% 0.4^sqrt(c(1:p)) + rnorm(n)
      
      # try all different dimensions
      for (j in 1:p)
      {
          # construct the data
          traindata = data.frame("x" = x[, 1:j, drop = FALSE], "y" = ytrain)
          testdata = data.frame("x" = x[, 1:j, drop = FALSE], "y" = ytest)
          
          # fit model and calculate testing error
          onefit = lm(y~., data = traindata)
          ypred = predict(onefit, testdata)
          error = mean( (ypred - testdata$y)^2 )
          
          allerrors[i, j] = error                  
      }
  }
  
  plot(colMeans(allerrors), type = "b", 
       col = "darkorange", lwd = 3, 
       xlab = "Number of Variables", 
       ylab = "Prediction Error")

The result shows that if we keep adding variables with small effects, the predict error would eventually grow. Then how to select these variables?

Model Selection Criteria and Algorithm

Example: diabetes dataset

We use the diabetes dataset from the lars package as a demonstration of model selection.

    library(lars)
    data(diabetes)
    diab = data.frame(cbind(diabetes$x, "Y" = diabetes$y))
    
    # A Brief Description of the Diabetes Data (Efron et al, 2004):
    # Ten baseline variables: age, sex, body mass index, average blood pressure, and six blood serum
    # measurements were obtained for each of n = 442 diabetes patients, as well as
    # the response of interest, a quantitative measure of disease progression one year after baseline 
    
    lmfit=lm(Y~., data=diab)
    
    # When we use normal distribution likelihood for the errors, there are 12 parameters
    # The function AIC() directly calculates the AIC score from a lm() fitted model 
    n = nrow(diab)
    p = 11

    # ?AIC
    AIC(lmfit) # a build-in function for calculating AIC using -2log likelihood
## [1] 4795.985
    n*log(sum(residuals(lmfit)^2/n)) + n + n*log(2*pi) + 2 + 2*p
## [1] 4795.985
    # In many standard R packages, the AIC is calculated by removing some constants from the likelihood 
    # We will use this value as the default
    ?extractAIC
    extractAIC(lmfit) # AIC for the full model
## [1]   11.000 3539.643
    RSS = sum(residuals(lmfit)^2)
    n*log(RSS/n) + 2*p
## [1] 3539.643
    # so the BIC for the full model is 
    extractAIC(lmfit, k = log(n))
## [1]   11.000 3584.648
    n*log(RSS/n) + log(n)*p
## [1] 3584.648
    # if we want to calculate Cp, use the formula
    RSS + 2*p*summary(lmfit)$sigma^2
## [1] 1328502
    # however, the scale of this is usually very large, we may consider the following version
    RSS/summary(lmfit)$sigma^2 + 2*p - n
## [1] 11

The step() function can be used to select the best model based on specified model selection criteria.

    # Model selection: stepwise algorithm 
    # ?step
    
    # this function shows every step during the model selection 
    step(lmfit, direction="both", k = 2)    # k = 2 (AIC) is default; 
## Start:  AIC=3539.64
## Y ~ age + sex + bmi + map + tc + ldl + hdl + tch + ltg + glu
## 
##        Df Sum of Sq     RSS    AIC
## - age   1        82 1264066 3537.7
## - hdl   1       663 1264646 3537.9
## - glu   1      3080 1267064 3538.7
## - tch   1      3526 1267509 3538.9
## <none>              1263983 3539.6
## - ldl   1      5799 1269782 3539.7
## - tc    1     10600 1274583 3541.3
## - sex   1     45000 1308983 3553.1
## - ltg   1     56015 1319998 3556.8
## - map   1     72103 1336086 3562.2
## - bmi   1    179028 1443011 3596.2
## 
## Step:  AIC=3537.67
## Y ~ sex + bmi + map + tc + ldl + hdl + tch + ltg + glu
## 
##        Df Sum of Sq     RSS    AIC
## - hdl   1       646 1264712 3535.9
## - glu   1      3001 1267067 3536.7
## - tch   1      3543 1267608 3536.9
## <none>              1264066 3537.7
## - ldl   1      5751 1269817 3537.7
## - tc    1     10569 1274635 3539.4
## + age   1        82 1263983 3539.6
## - sex   1     45831 1309896 3551.4
## - ltg   1     55963 1320029 3554.8
## - map   1     73850 1337915 3560.8
## - bmi   1    179079 1443144 3594.2
## 
## Step:  AIC=3535.9
## Y ~ sex + bmi + map + tc + ldl + tch + ltg + glu
## 
##        Df Sum of Sq     RSS    AIC
## - glu   1      3093 1267805 3535.0
## - tch   1      3247 1267959 3535.0
## <none>              1264712 3535.9
## - ldl   1      7505 1272217 3536.5
## + hdl   1       646 1264066 3537.7
## + age   1        66 1264646 3537.9
## - tc    1     26840 1291552 3543.2
## - sex   1     46382 1311094 3549.8
## - map   1     73536 1338248 3558.9
## - ltg   1     97509 1362221 3566.7
## - bmi   1    178537 1443249 3592.3
## 
## Step:  AIC=3534.98
## Y ~ sex + bmi + map + tc + ldl + tch + ltg
## 
##        Df Sum of Sq     RSS    AIC
## - tch   1      3686 1271491 3534.3
## <none>              1267805 3535.0
## - ldl   1      7472 1275277 3535.6
## + glu   1      3093 1264712 3535.9
## + hdl   1       738 1267067 3536.7
## + age   1         0 1267805 3537.0
## - tc    1     26378 1294183 3542.1
## - sex   1     44686 1312491 3548.3
## - map   1     82154 1349959 3560.7
## - ltg   1    102520 1370325 3567.3
## - bmi   1    189970 1457775 3594.7
## 
## Step:  AIC=3534.26
## Y ~ sex + bmi + map + tc + ldl + ltg
## 
##        Df Sum of Sq     RSS    AIC
## <none>              1271491 3534.3
## + tch   1      3686 1267805 3535.0
## + glu   1      3532 1267959 3535.0
## + hdl   1       395 1271097 3536.1
## + age   1        11 1271480 3536.3
## - ldl   1     39378 1310869 3545.7
## - sex   1     41858 1313349 3546.6
## - tc    1     65237 1336728 3554.4
## - map   1     79627 1351119 3559.1
## - bmi   1    190586 1462077 3594.0
## - ltg   1    294094 1565585 3624.2
## 
## Call:
## lm(formula = Y ~ sex + bmi + map + tc + ldl + ltg, data = diab)
## 
## Coefficients:
## (Intercept)          sex          bmi          map           tc          ldl          ltg  
##       152.1       -226.5        529.9        327.2       -757.9        538.6        804.2
    step(lmfit, direction="backward", trace=0) # trace=0 will not print intermediate results
## 
## Call:
## lm(formula = Y ~ sex + bmi + map + tc + ldl + ltg, data = diab)
## 
## Coefficients:
## (Intercept)          sex          bmi          map           tc          ldl          ltg  
##       152.1       -226.5        529.9        327.2       -757.9        538.6        804.2
    step(lm(Y~1, data=diab), scope=list(upper=lmfit, lower=~1), direction="forward", trace=0)
## 
## Call:
## lm(formula = Y ~ bmi + ltg + map + tc + sex + ldl, data = diab)
## 
## Coefficients:
## (Intercept)          bmi          ltg          map           tc          sex          ldl  
##       152.1        529.9        804.2        327.2       -757.9       -226.5        538.6
    step(lmfit, direction="both", k=log(n), trace=0)  # BIC (the default value for k=2, which corresponds to AIC)
## 
## Call:
## lm(formula = Y ~ sex + bmi + map + tc + ldl + ltg, data = diab)
## 
## Coefficients:
## (Intercept)          sex          bmi          map           tc          ldl          ltg  
##       152.1       -226.5        529.9        327.2       -757.9        538.6        804.2

The leaps package will calculate the best model of each model size. Then we can add the penalties to the model fitting result and conclude the best model.

    ##########################################################################
    # Best subset model selection (Cp, AIC, and BIC): leaps 
    ##########################################################################
    library(leaps)
    
    # performs an exhaustive search over models, and gives back the best model 
    # (with low RSS) of each size.
    # the default maximum model size is nvmax=8
    
    RSSleaps=regsubsets(as.matrix(diab[,-11]),diab[,11])
    summary(RSSleaps, matrix=T)
## Subset selection object
## 10 Variables  (and intercept)
##     Forced in Forced out
## age     FALSE      FALSE
## sex     FALSE      FALSE
## bmi     FALSE      FALSE
## map     FALSE      FALSE
## tc      FALSE      FALSE
## ldl     FALSE      FALSE
## hdl     FALSE      FALSE
## tch     FALSE      FALSE
## ltg     FALSE      FALSE
## glu     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          age sex bmi map tc  ldl hdl tch ltg glu
## 1  ( 1 ) " " " " "*" " " " " " " " " " " " " " "
## 2  ( 1 ) " " " " "*" " " " " " " " " " " "*" " "
## 3  ( 1 ) " " " " "*" "*" " " " " " " " " "*" " "
## 4  ( 1 ) " " " " "*" "*" "*" " " " " " " "*" " "
## 5  ( 1 ) " " "*" "*" "*" " " " " "*" " " "*" " "
## 6  ( 1 ) " " "*" "*" "*" "*" "*" " " " " "*" " "
## 7  ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" " "
## 8  ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" "*"
    RSSleaps=regsubsets(as.matrix(diab[,-11]),diab[,11], nvmax=10)
    summary(RSSleaps,matrix=T)
## Subset selection object
## 10 Variables  (and intercept)
##     Forced in Forced out
## age     FALSE      FALSE
## sex     FALSE      FALSE
## bmi     FALSE      FALSE
## map     FALSE      FALSE
## tc      FALSE      FALSE
## ldl     FALSE      FALSE
## hdl     FALSE      FALSE
## tch     FALSE      FALSE
## ltg     FALSE      FALSE
## glu     FALSE      FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
##           age sex bmi map tc  ldl hdl tch ltg glu
## 1  ( 1 )  " " " " "*" " " " " " " " " " " " " " "
## 2  ( 1 )  " " " " "*" " " " " " " " " " " "*" " "
## 3  ( 1 )  " " " " "*" "*" " " " " " " " " "*" " "
## 4  ( 1 )  " " " " "*" "*" "*" " " " " " " "*" " "
## 5  ( 1 )  " " "*" "*" "*" " " " " "*" " " "*" " "
## 6  ( 1 )  " " "*" "*" "*" "*" "*" " " " " "*" " "
## 7  ( 1 )  " " "*" "*" "*" "*" "*" " " "*" "*" " "
## 8  ( 1 )  " " "*" "*" "*" "*" "*" " " "*" "*" "*"
## 9  ( 1 )  " " "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 10  ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
    sumleaps=summary(RSSleaps,matrix=T)
    names(sumleaps)  # components returned by summary(RSSleaps)
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"
    sumleaps$which
##    (Intercept)   age   sex  bmi   map    tc   ldl   hdl   tch   ltg   glu
## 1         TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2         TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## 3         TRUE FALSE FALSE TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
## 4         TRUE FALSE FALSE TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE
## 5         TRUE FALSE  TRUE TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
## 6         TRUE FALSE  TRUE TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE
## 7         TRUE FALSE  TRUE TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
## 8         TRUE FALSE  TRUE TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## 9         TRUE FALSE  TRUE TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## 10        TRUE  TRUE  TRUE TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    msize=apply(sumleaps$which,1,sum)
    n=dim(diab)[1]
    p=dim(diab)[2]
    Cp = sumleaps$rss/(summary(lmfit)$sigma^2) + 2*msize - n;
    AIC = n*log(sumleaps$rss/n) + 2*msize;
    BIC = n*log(sumleaps$rss/n) + msize*log(n);
    
    cbind(Cp, sumleaps$cp)
##            Cp           
## 1  148.352561 148.352561
## 2   47.072229  47.072229
## 3   30.663634  30.663634
## 4   21.998461  21.998461
## 5    9.148045   9.148045
## 6    5.560162   5.560162
## 7    6.303221   6.303221
## 8    7.248522   7.248522
## 9    9.028080   9.028080
## 10  11.000000  11.000000
    cbind(BIC, sumleaps$bic)  # It seems regsubsets uses a formula for BIC different from the one we used. 
##         BIC          
## 1  3665.879 -174.1108
## 2  3586.331 -253.6592
## 3  3575.249 -264.7407
## 4  3571.077 -268.9126
## 5  3562.469 -277.5210
## 6  3562.900 -277.0899
## 7  3567.708 -272.2819
## 8  3572.720 -267.2702
## 9  3578.585 -261.4049
## 10 3584.648 -255.3424
    BIC-sumleaps$bic  # But the two just differ by a constant, so won't affect the model selection result. 
##       1       2       3       4       5       6       7       8       9      10 
## 3839.99 3839.99 3839.99 3839.99 3839.99 3839.99 3839.99 3839.99 3839.99 3839.99
    n*log(sum((diab[,11] - mean(diab[,11]))^2/n)) # the difference is the score of an intercept model
## [1] 3839.99
    # Rescale Cp, AIC, BIC to (0,1).
    inrange <- function(x) { (x - min(x)) / (max(x) - min(x)) }
    
    Cp = sumleaps$cp; Cp = inrange(Cp);
    BIC = sumleaps$bic; BIC = inrange(BIC);
    AIC = n*log(sumleaps$rss/n) + 2*msize; AIC = inrange(AIC);
    
    plot(range(msize), c(0, 1.1), type="n", 
         xlab="Model Size (with Intercept)", ylab="Model Selection Criteria")
    points(msize, Cp, col="red", type="b")
    points(msize, AIC, col="blue", type="b")
    points(msize, BIC, col="black", type="b")
    legend("topright", lty=rep(1,3), col=c("red", "blue", "black"), legend=c("Cp", "AIC", "BIC"))