Chapter 4 Linear Algebra Basics

You should already be familiar with some basic linear algebra concepts such as matrix and vector multiplications. Here we review some basic concepts and properties that will be used in this course. For the most part, they are used in deriving linear regression results.

4.1 Definition

We usually use \(\mathbf{X}\) to denote an \(n \times p\) dimensional design matrix, where \(n\) is the number of observations and \(p\) is the number of variables. The columns of \(\mathbf{x}\) are denoted as \(\mathbf{x}_1, \ldots, \mathbf{x}_p\):

\[ \mathbf{X}= \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1p}\\ x_{21} & x_{22} & \cdots & x_{2p}\\ \vdots & \vdots & \ddots & \vdots\\ x_{m1} & x_{n2} & \cdots & x_{np}\\ \end{pmatrix} = \begin{pmatrix} \mathbf{x}_1 & \mathbf{x}_2 & \cdots & \mathbf{x}_p\\ \end{pmatrix} \]

The column space, \(\cal{C}(\mathbf{X})\) of \(\mathbf{X}\) is the set of all linear combinations of \(\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_p\), i.e.,

\[c_1 \mathbf{x}_1 + c_2 \mathbf{x}_2 + \cdots c_p \mathbf{x}_p.\]

This is also called the span of these vectors, \(\text{span}(\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_p)\). Its orthogonal space is

\[\{\mathbf{v}: \mathbf{X}^\text{T}\mathbf{v}= 0\}.\]

4.2 Linear Regression

In many cases, we will be using linear regression as an example. This concerns regressing a vector of outcome \(\mathbf{y}\) onto the column space of \(\mathbf{X}\). Or in other words, finding a set of coefficients \((c_1, \ldots, c_p)\) such that the Euclidean distance between \(c_1 \mathbf{x}_1 + c_2 \mathbf{x}_2 + \cdots c_p \mathbf{x}_p\) and \(\mathbf{y}\) is minimized. One way to view this problem is to project the vector \(\mathbf{y}\) onto this space \(\cal{C}(\mathbf{X})\). This can be done through a linear operator, called the projection matrix \(\mathbf{P}\), defined as

\[ \mathbf{X}(\mathbf{X}^\text{T}\mathbf{X})^{-1} \mathbf{X}^\text{T}, \]

provided that \(\mathbf{X}\) has full column rank. Hence, the projection can be obtained as

\[ \mathbf{P}\mathbf{y}= \mathbf{X}(\mathbf{X}^\text{T}\mathbf{X})^{-1} \mathbf{X}^\text{T}\mathbf{y}/ \]

Noticing that this is a linear combination of the columns of \(\mathbf{X}\) since \((\mathbf{X}^\text{T}\mathbf{X})^{-1} \mathbf{X}^\text{T}\mathbf{y}\) is a \(p-\)vector, which is the vector \((c_1, \ldots, c_p)^\text{T}\). And if you are familiar with the solution from a linear regression, this is \((\mathbf{X}^\text{T}\mathbf{X})^{-1} \mathbf{X}^\text{T}\mathbf{y}\) is the solution, while \(\mathbf{P}\mathbf{y}\) is the fitted value. On the other hand, \(\mathbf{y}- \mathbf{P}\mathbf{y}\) is the residual vector.

There are some useful properties of a projection matrix \(\mathbf{P}\)

  • Idempotent: A projection matrix is idempotent, which means that multiplying it by itself leaves it unchanged: \(\mathbf{P}^2 = \mathbf{P}\)
  • Symmetry (for Orthogonal Projections): If the projection is orthogonal, the matrix is symmetric: \(\mathbf{P}= \mathbf{P}^T\)
  • Spectrum: The eigenvalues of a projection matrix are either 0 or 1. The number of eigenvalues equal to 1 is the rank of the matrix.
  • Rank and Nullity: If \(\mathbf{P}\) is a projection onto a subspace of dimension \(k\), then the rank of \(\mathbf{P}\) is \(k\), and the nullity (dimension of the null space) is \(n - k\), where \(n\) is the dimension of the space onto which \(\mathbf{P}\) projects.
  • Orthogonal Complement: If \(\mathbf{P}\) is the projection onto a subspace \(U\), then \((I - \mathbf{P})\) is the projection onto the orthogonal complement of \(U\).

These properties make projection matrices useful in various applications, including least squares regression, signal processing, and many areas of machine learning and statistics.

4.3 Matrix Inversion

4.3.1 Rank-one Update

The Sherman-Morrison formula is a helpful result in linear algebra, especially in the context of statistical computations. It provides an expression for the inverse of a rank-one perturbation of a given invertible matrix. Here’s the formula in LaTeX form:

\[ (\mathbf{A}+ \mathbf{u}\mathbf{v}^T)^{-1} = \mathbf{A}^{-1} - \frac{\mathbf{A}^{-1}\mathbf{u}\mathbf{v}^T \mathbf{A}^{-1}}{1 + \mathbf{v}^T \mathbf{A}^{-1}\mathbf{u}} \]

Here, \(\mathbf{A}\) is an \(n \times n\) invertible matrix, and \(\mathbf{u}\) and $ $ are \(n \times 1\) vectors. The denominator in the expression ensures that the resulting matrix is well-defined, and it has many applications in statistical computations, optimization, and other areas.

4.3.2 Rank-\(k\) Update

The Woodbury Matrix Identity is another powerful result used in statistical computations and relates to the inversion of a rank-k correction of a matrix:

\[ (\mathbf{A}+ \mathbf{U}\mathbf{C}\mathbf{V}^T)^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1}\mathbf{U}(\mathbf{C}^{-1} + \mathbf{V}^T\mathbf{A}^{-1}\mathbf{U})^{-1}\mathbf{V}^T\mathbf{A}^{-1} \]

This can be seen as an extension of the Sherman-Morrison formula and can be particularly useful when dealing with high-dimensional data or large-scale computations.

4.3.3 2 \(\times\) 2 Block Matrix Inversion

A general 2x2 block matrix can be represented as:

\[ \mathbf{M}= \begin{bmatrix} \mathbf{A}& \mathbf{B}\\ \mathbf{C}& \mathbf{D}\end{bmatrix} \]

If the matrix is invertible, its inverse can be expressed as:

\[ \mathbf{M}^{-1} = \begin{bmatrix} \mathbf{A}^{-1} + \mathbf{A}^{-1}\mathbf{B}(\mathbf{D}- \mathbf{C}\mathbf{A}^{-1}\mathbf{B})^{-1}\mathbf{C}\mathbf{A}^{-1} & -\mathbf{A}^{-1}\mathbf{B}(\mathbf{D}- \mathbf{C}\mathbf{A}^{-1}\mathbf{B})^{-1} \\ -(\mathbf{D}- \mathbf{C}\mathbf{A}^{-1}\mathbf{B})^{-1}\mathbf{C}\mathbf{A}^{-1} & (\mathbf{D}- \mathbf{C}\mathbf{A}^{-1}\mathbf{B})^{-1} \end{bmatrix} \]

This formula assumes that the required inverses exist.

4.4 Matrix norms

Here are several commonly used Matrix norms

  • Frobenius Norm: \[ \| \mathbf{A}\|_F = \sqrt{\sum_{i=1}^{m}\sum_{j=1}^{n} |a_{ij}|^2} \]
  • 1-Norm (Column Sum Norm): \[ \| \mathbf{A}\|_1 = \max_{1 \leq j \leq n} \sum_{i=1}^{m} |a_{ij}| \]
  • Infinity Norm (Row Sum Norm): \[ \| \mathbf{A}\|_\infty = \max_{1 \leq i \leq m} \sum_{j=1}^{n} |a_{ij}| \]
  • 2-Norm (Spectral Norm): \[ \| \mathbf{A}\|_2 = \sigma_{\text{max}}(\mathbf{A}) = \sqrt{\lambda_{\text{max}}(\mathbf{A}^T \mathbf{A})} \]
  • p-Norm: \[ \| \mathbf{A}\|_p = \left( \sum_{i=1}^{\min(m,n)} \sigma_i^p \right)^{1/p} \]
  • Operator Norm: \[ \| \mathbf{A}\| = \max_{\mathbf{x}\neq \mathbf{0}} \frac{\| \mathbf{A}\mathbf{x}\|}{\| \mathbf{x}\|} \]

And some of their relationships

  • Relationship between 1-Norm, 2-Norm, and Infinity Norm \[ \| \mathbf{A}\|_2 \leq \| \mathbf{A}\|_1^{1/2} \| \mathbf{A}\|_\infty^{1/2} \] \[ \| \mathbf{A}\|_\infty \leq \| \mathbf{A}\|_1 \] \[ \| \mathbf{A}\|_2 \leq \| \mathbf{A}\|_1 \leq \sqrt{n} \| \mathbf{A}\|_2 \]

  • Relationship between Frobenius Norm and 2-Norm \[ \| \mathbf{A}\|_2 \leq \| \mathbf{A}\|_F \leq \sqrt{\min\{m,n\}} \| \mathbf{A}\|_2 \]