Now, let’s discuss several specific methods. One of the oldest one is Newton’s method. This is motivated form a quadratic approximation (essentially second order Taylor expansion) at a current point \(\mathbf{x}\),
\[f(\mathbf{x}^\ast) \approx f(\mathbf{x}) + \nabla f(\mathbf{x})^\text{T}(\mathbf{x}^\ast - \mathbf{x}) + \frac{1}{2} (\mathbf{x}^\ast - \mathbf{x})^\text{T}\mathbf{H}(\mathbf{x}) (\mathbf{x}^\ast - \mathbf{x})\] Our goal is to find a new stationary point \(\mathbf{x}^\ast\) such that \(\nabla f(\mathbf{x}^\ast) = 0\). By taking derivative of the above equation on both sides, with respect to \(\mathbf{x}^\ast\), we need
\[0 = \nabla f(\mathbf{x}^\ast) = 0 + \nabla f(\mathbf{x}) + \mathbf{H}(\mathbf{x}) (\mathbf{x}^\ast - \mathbf{x})\] which leads to
\[\mathbf{x}^\ast = \mathbf{x}- \mathbf{H}(\mathbf{x})^{-1} \nabla f(\mathbf{x}).\]
Hence, if we are currently at a point \(\mathbf{x}^{(k)}\), we need to calculate the gradient \(\nabla f(\mathbf{x}^{(k)})\) and Hessian \(\mathbf{H}(\mathbf{x})\) at this point, then move to the new point using \(\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} - \mathbf{H}(\mathbf{x}^{(k)})^{-1} \nabla f(\mathbf{x}^{(k)})\). Some properties and things to concern regarding Newton’s method:
Since the idea of Newton’s method is to solve a vector \(\mathbf{v}\) such that
\[\mathbf{H}(\mathbf{x}^{(k)}) \mathbf{v}=
- \nabla f(\mathbf{x}^{(k)}), \] If \(\mathbf{H}\) is difficult to compute, we
may use some matrix to substitute it. For example, if we simplify use
the identity matrix \(\mathbf{I}\),
then this reduces to a first-order method to be introduced later.
However, if we can get a good approximation, we can still solve this
linear system and get to a better point. Then the question is how to
obtain such matrix in a computationally inexpensive way. The
Broyden–Fletcher–Goldfarb–Shanno (BFSG) algorithm is such an approach by
iteratively updating its (inverse) estimation. For details, please see
the SMLR
book. We have already used the BFGS method previously in the
optim()
example.
When simply using \(\mathbf{H}= \mathbf{I}\), we update \[\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} - \delta \nabla f(\mathbf{x}^{(k)}).\] However, it is then crucial to figure out the step size \(\delta\). A step size too large may not even converge at all, however, a step size too small will take too many iterations to converge. Alternatively, line search could be used.
We use linear regression as an example. The objective function for linear regression is:
\[ \ell(\boldsymbol \beta) = \frac{1}{2n}||\mathbf{y} - \mathbf{X} \boldsymbol \beta ||^2 \] with solution
\[\widehat{\boldsymbol \beta} = \left(\mathbf{X}^\text{T}\mathbf{X}\right)^{-1} \mathbf{X}^\text{T} \mathbf{y} \]
par(bg="transparent")
library(MASS)
set.seed(3)
n = 200
# create some data with linear model
X = mvrnorm(n, c(0, 0), matrix(c(1,0.7, 0.7, 1), 2,2))
y = rnorm(n, mean = 2*X[,1] + X[,2])
beta1 <- seq(-1, 4, 0.005)
beta2 <- seq(-1, 4, 0.005)
allbeta <- data.matrix(expand.grid(beta1, beta2))
rss <- matrix(apply(allbeta, 1, function(b, X, y) sum((y - X %*% b)^2), X, y),
length(beta1), length(beta2))
# quantile levels for drawing contour
quanlvl = c(0.01, 0.025, 0.05, 0.2, 0.5, 0.75)
# plot the contour
contour(beta1, beta2, rss, levels = quantile(rss, quanlvl))
box()
# the truth
b = solve(t(X) %*% X) %*% t(X) %*% y
points(b[1], b[2], pch = 19, col = "blue", cex = 2)
This is a contour plot, with its lowest point at the linear regression parameter estimate \(\widehat{\boldsymbol \beta}\). We use an optimization approach to solve this problem. By taking the derivative with respect to \(\boldsymbol \beta\), we have the gradient
\[ \begin{align} \frac{\partial \ell(\boldsymbol \beta)}{\partial \boldsymbol \beta} = -\frac{1}{n} \sum_{i=1}^n (y_i - x_i^\text{T} \boldsymbol \beta) x_i. \end{align} \] To perform the optimization, we will first set an initial beta value, say \(\boldsymbol \beta = \mathbf{0}\) for all entries, then proceed with the update
\[\begin{align} \boldsymbol \beta^\text{new} =& \boldsymbol \beta^\text{old} - \frac{\partial \ell(\boldsymbol \beta)}{\partial \boldsymbol \beta} \times \delta.\\ =&\boldsymbol \beta^\text{old} + \delta \frac{1}{n} \sum_{i=1}^n (y_i - x_i^\text{T} \boldsymbol \beta) x_i.\\ \end{align}\]
Let’s set \(\delta = 0.2\) for now. The following function performs gradient descent.
# gradient descent function, which also record the path
mylm_g <- function(x, y,
b0 = rep(0, ncol(x)), # initial value
delta = 0.2, # step size
epsilon = 1e-6, #stopping rule
maxitr = 5000) # maximum iterations
{
if (!is.matrix(x)) stop("x must be a matrix")
if (!is.vector(y)) stop("y must be a vector")
if (nrow(x) != length(y)) stop("number of observations different")
# initialize beta values
allb = matrix(b0, 1, length(b0))
# iterative update
for (k in 1:maxitr)
{
# the new beta value
b1 = b0 + t(x) %*% (y - x %*% b0) * delta / length(y)
# record the new beta
allb = rbind(allb, as.vector(b1))
# stopping rule
if (max(abs(b0 - b1)) < epsilon)
break;
# reset beta0
b0 = b1
}
if (k == maxitr) cat("maximum iteration reached\n")
return(list("allb" = allb, "beta" = b1))
}
# fit the model
mybeta = mylm_g(X, y, b0 = c(0, 1))
par(bg="transparent")
contour(beta1, beta2, rss, levels = quantile(rss, quanlvl))
points(mybeta$allb[,1], mybeta$allb[,2], type = "b", col = "red", pch = 19)
points(b[1], b[2], pch = 19, col = "blue", cex = 1.5)
box()
The descent path is very smooth because we choose a very small step size. However, if the step size is too large, we may observe unstable results or even unable to converge. For example, if We set \(\delta = 1\) or \(\delta = 1.5\).
par(mfrow=c(1,2))
par(mar=c(2,2,2,2))
par(bg="transparent")
# fit the model with a larger step size
mybeta = mylm_g(X, y, b0 = c(0, 1), delta = 1)
contour(beta1, beta2, rss, levels = quantile(rss, quanlvl))
points(mybeta$allb[,1], mybeta$allb[,2], type = "b", col = "red", pch = 19)
points(b[1], b[2], pch = 19, col = "blue", cex = 1.5)
box()
# and even larger
mybeta = mylm_g(X, y, b0 = c(0, 1), delta = 1.5, maxitr = 6)
## maximum iteration reached
contour(beta1, beta2, rss, levels = quantile(rss, quanlvl))
points(mybeta$allb[,1], mybeta$allb[,2], type = "b", col = "red", pch = 19)
points(b[1], b[2], pch = 19, col = "blue", cex = 1.5)
box()
optim()
function, the third timeWhen we only supply the optim()
function the objective
function definition fn =
, it will internally numerically
approximate the gradient. Sometimes this is not preferred due to
computational reasons. And when you do have the theoretical gradient, it
is likely to be more accurate than numerically approximated values.
Hence, as long as the gradient itself does not cost too much compute, we
may take advantage of it. We could supply the optim()
function using the gradient gr =
. This is the example we
used previously:
# generate data from a simple linear model
set.seed(20)
n = 150
X <- cbind(1, rnorm(n))
y <- X %*% c(0.5, 1) + rnorm(n)
Please note that the calculation of the gradient is in a matrix form, note that \(\mathbf{X}\) is the \(n \times p\) design matrix. This should look familiar to you since it is the normal equation used in solving the linear regression analytic solution.
\[ -\frac{1}{n} \sum_{i=1}^n (y_i - x_i^\text{T} \boldsymbol \beta) x_i = - \frac{1}{n} \mathbf{X}^\text{T}(\mathbf{y}- \mathbf{X}\boldsymbol \beta) \]
# calculate the residual sum of squares for a grid of beta values
rss <- function(b, trainx, trainy) mean((trainy - trainx %*% b)^2)
# the gradient formula is already provided
rss_g <- function(b, trainx, trainy) -2*t(trainx) %*% (trainy - trainx %*% b) / nrow(trainx)
# The solution can be solved by any optimization algorithm
lm.optim.g <- optim(par = c(2, 2), fn = rss, gr = rss_g, method = "BFGS", trainx = X, trainy = y)
lm.optim.g
## $par
## [1] 0.566921 1.016770
##
## $value
## [1] 1.076857
##
## $counts
## function gradient
## 8 5
##
## $convergence
## [1] 0
##
## $message
## NULL
When fitting linear regressions, outliers could significantly affect the fitting results. However, manually checking and removing outliers can be tricky and time consuming. Some regression methods address this problem by using a more robust loss function. For example, the Huber loss. The idea is to minimize
\[ \frac{1}{n} \sum_{i=1}^n \ell_\delta(y_i - x_i^\text{T} \boldsymbol \beta) \] where the Huber loss function \(\ell_\delta\) is defined as
\[ \ell_\delta( a ) = \begin{cases} \frac{1}{2} a^2 & \quad \text{if } |a| \leq \delta \\ \delta(|a| - \frac{1}{2} \delta) & \quad \text{o.w.} \end{cases} \]
Huber <- function(a, delta = 1) ifelse(abs(a) <= delta, 0.5*a^2, delta*( abs(a) - 0.5*delta))
x = seq(-4, 4, 0.001)
plot(x, Huber(x), type = "l",
xlab = "a", ylab = "Huber Loss",
col = "darkorange", ylim = c(0, 8))
lines(x, 0.5*x^2, col = "deepskyblue")
Huber_f <- function(b, trainx, trainy) mean(Huber(trainy - trainx %*% b, delta = 1))
optim(par = c(2, 2), fn = Huber_f, method = "BFGS", trainx = X, trainy = y)
# generate data from a simple linear model
set.seed(20)
n = 150
X <- cbind(1, rnorm(n))
y <- X %*% c(0.5, 1) + rnorm(n)
y[7] = -30
optim(par = c(2, 2), fn = Huber_f, method = "BFGS", trainx = X, trainy = y)
lm(y ~ X - 1)
Constrained problems are even more commonly seen in this course, such as Lasso, Ridge and SVM. However, with suitable conditions, they can be transform to a easier, non-constrained problem using Lagrangian. An example is provided at here.