Chapter 19 Boosting
Boosting is another ensemble model, created in the form of
FT(x)=T∑t=1αtft(x)
However, it is different from random forest, in which each ft(x) is learned parallelly. These ft(x)’s are called weak learners and are constructed sequentially, with coefficients αt’s to represent their weights. The most classical model, AdaBoost was proposed by Freund and Schapire (1997) for classification problems, and a more statically view of this type of model called gradient boosting machines (J. H. Friedman 2001) can handle any loss function we commonly use. We will first introduce AdaBoost and then discuss gradient boosting.
19.1 AdaBoost
Following our common notation, we observe a set of data {xi,yi}ni=1. Similar to SVM, we code yis as −1 or 1. The AdaBoost works by creating FT(x) sequentially and use sign(FT(x)) as the classification rule. The algorithm is given in the following:
- Initiate weights w(1)i=1/n, for i=1,…,n
- For t=1,…,T, do
- Fit a classifier ft(x) to the training data with subject weights w(t)i’s.
- Compute the weighed error rate ϵt=n∑i=1w(t)i1{yi≠ft(xi)}
- Compute αt=12log1−ϵtϵt
- Update subject weights w(t+1)i=1Ztw(t)iexp{−αtyift(xi)} where Zt is a normalizing constant make w(t+1)i’s sum up to 1: Zt=n∑i=1w(t)iexp{−αtyift(xi)}
- Output the final model FT(x)=T∑t=1αtft(x) and the decision rule is sign(FT(x)).
An important mechanism in AdaBoost is the weight update step. We can notice that the weight is increased if exp{−αtyift(xi)} is larger than 1. This is simply when yift(xi) is negative, i.e., subject i got mis-classified by ft at this iteration. Hence, during the next iteration t+1, the model f(t+1) will more likely to address this subject. Here, ft can be any classification model, for example, we could use a tree model. The following figures demonstrate this idea of updating weights and aggregate the learners.
x1 = seq(0.1, 1, 0.1)
x2 = c(0.5, 0.3, 0.1, 0.6, 0.7,
0.8, 0.5, 0.7, 0.8, 0.2)
# the data
y = c(1, 1, -1, -1, 1,
1, -1, 1, -1, -1)
X = cbind("x1" = x1, "x2" = x2)
xgrid = expand.grid("x1" = seq(0, 1.1, 0.01), "x2" = seq(0, 0.9, 0.01))
# plot data
plot(X[, 1], X[, 2], col = ifelse(y > 0, "deepskyblue", "darkorange"),
pch = ifelse(y > 0, 4, 1), xlim = c(0, 1.1), lwd = 3,
ylim = c(0, 0.9), cex = 3)
# fit gbm with 3 trees
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
gbm.fit = gbm(y ~., data.frame(x1, x2, y= as.numeric(y == 1)),
distribution="adaboost", interaction.depth = 1,
n.minobsinnode = 1, n.trees = 3,
shrinkage = 1, bag.fraction = 1)
# you may peek into each tree
pretty.gbm.tree(gbm.fit, i.tree = 1)
## SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
## 0 0 0.25 1 2 3 2.5 10 0.00
## 1 -1 1.00 -1 -1 -1 0.0 2 1.00
## 2 -1 -0.25 -1 -1 -1 0.0 8 -0.25
## 3 -1 0.00 -1 -1 -1 0.0 10 0.00
# we can view the predicted decision rule
plot(X[, 1], X[, 2], col = ifelse(y > 0, "deepskyblue", "darkorange"),
pch = ifelse(y > 0, 4, 1), xlim = c(0, 1.1), lwd = 3,
ylim = c(0, 0.9), cex = 3)
pred = predict(gbm.fit, xgrid)
## Using 3 trees...
points(xgrid, col = ifelse(pred > 0, "deepskyblue", "darkorange"),
cex = 0.2)
Here is a rundown of the algorithm. Let’s initialize all weights as 1/n. We only used trees with a single split as weak learners. The first tree is splitting at X1=0.25. After the first split, we need to adjust the weights.
w1 = rep(1/10, 10)
f1 <- function(x) ifelse(x[, 1] < 0.25, 1, -1)
e1 = sum(w1*(f1(X) != y))
a1 = 0.5*log((1-e1)/e1)
w2 = w1*exp(- a1*y*f1(X))
w2 = w2/sum(w2)
# the first tree
plot(X[, 1], X[, 2], col = ifelse(y > 0, "deepskyblue", "darkorange"),
pch = ifelse(y > 0, 4, 1), xlim = c(0, 1.1), lwd = 3,
ylim = c(0, 0.9), cex = 3)
pred = f1(xgrid)
points(xgrid, col = ifelse(pred > 0, "deepskyblue", "darkorange"),
cex = 0.2)
# weights after the first tree
plot(X[, 1], X[, 2], col = ifelse(y > 0, "deepskyblue", "darkorange"),
pch = ifelse(y > 0, 4, 1), xlim = c(0, 1.1), lwd = 3,
ylim = c(0, 0.9), cex = 30*w2)
We can notice that the observations got correctly classified will decrease their weights while those mis-classified will increase the weights.
f2 <- function(x) ifelse(x[, 2] > 0.65, 1, -1)
e2 = sum(w2*(f2(X) != y))
a2 = 0.5*log((1-e2)/e2)
w3 = w2*exp(- a2*y*f2(X))
w3 = w3/sum(w3)
# the second tree
plot(X[, 1], X[, 2], col = ifelse(y > 0, "deepskyblue", "darkorange"),
pch = ifelse(y > 0, 4, 1), xlim = c(0, 1.1), lwd = 3,
ylim = c(0, 0.9), cex = 30*w2)
pred = f2(xgrid)
points(xgrid, col = ifelse(pred > 0, "deepskyblue", "darkorange"),
cex = 0.2)
# weights after the second tree
plot(X[, 1], X[, 2], col = ifelse(y > 0, "deepskyblue", "darkorange"),
pch = ifelse(y > 0, 4, 1), xlim = c(0, 1.1), lwd = 3,
ylim = c(0, 0.9), cex = 30*w3)
And then we have the third step. Combining all three steps and their decision function, we have the final classifier
F3(x)=3∑t=1αtft(x)=0.4236⋅f1(x)+0.6496⋅f2(x)+0.9229⋅f3(x)
f3 <- function(x) ifelse(x[, 1] < 0.85, 1, -1)
e3 = sum(w3*(f3(X) != y))
a3 = 0.5*log((1-e3)/e3)
# the third tree
plot(X[, 1], X[, 2], col = ifelse(y > 0, "deepskyblue", "darkorange"),
pch = ifelse(y > 0, 4, 1), xlim = c(0, 1.1), lwd = 3,
ylim = c(0, 0.9), cex = 30*w3)
pred = f3(xgrid)
points(xgrid, col = ifelse(pred > 0, "deepskyblue", "darkorange"),
cex = 0.2)
# the final decision rule
plot(X[, 1], X[, 2], col = ifelse(y > 0, "deepskyblue", "darkorange"),
pch = ifelse(y > 0, 4, 1), xlim = c(0, 1.1), lwd = 3,
ylim = c(0, 0.9), cex = 3)
pred = a1*f1(xgrid) + a2*f2(xgrid) + a3*f3(xgrid)
points(xgrid, col = ifelse(pred > 0, "deepskyblue", "darkorange"),
cex = 0.2)
abline(v = 0.25) # f1
abline(h = 0.65) # f2
abline(v = 0.85) # f3
19.2 Training Error of AdaBoost
There is an interesting property about the boosting algorithm that if we can always find a classifier that performs better than random guessing at each iteration t, then the training error will eventually converge to zero. This works by analyzing the weight after the last iteration T:
w(T+1)i=1ZTw(T)iexp{−αtyift(xi)}=1Z1⋯ZTw(1)iT∏t=1exp{−αtyift(xi)}=1Z1⋯ZT1nexp{−yiT∑t=1αtft(xi)}
Since ∑Tt=1αtft(xi) is just the model at the T-th iteration, we can write it as FT(xi). Noticing that they sum up to 1, we have
1=n∑i=1w(T+1)i=1Z1⋯ZT1nn∑i=1exp{−yiFT(xi)} and Z1⋯ZT=1nn∑i=1exp{−yiFT(xi)} On the right-hand-side, this is the exponential loss after we fit the model. In fact, this quantity would bound above the 0/1 loss, since the exponential loss is exp[−yf(x)],
- For correctly classified subjects, yf(x)>0, and exp[−yf(x)]>0
- For incorrectly classified subjects, yf(x)<0 the exponential loss is larger than 1
This means that
Z1⋯ZT>1nn∑i=11{yi≠sign(FT(xi))} Hence, if we want the final model to have low training error, we should bound above the Zt’s. Recall that Zt is used to normalize the weights, we have
Zt=n∑iw(t)iexp[−αtyift(xi)]. We have two cases at this iteration, yif(xi)=1 for correct subjects, and yif(xi)=−1 for the incorrect ones, hence, By our definition, ϵt=∑iw(t)i1{yi≠ft(xi)} is the proportion of weights for mis-classified samples. Zt=n∑i=1w(t)iexp[−αtyift(xi)]=∑yi=ft(xi)w(t)iexp[−αt]+∑yi≠ft(xi)w(t)iexp[αt]=exp[−αt]∑yi=ft(xi)w(t)i+exp[αt]∑yi≠ft(xi)w(t)i
So we have
Zt=(1−ϵt)exp[−αt]+ϵtexp[αt].
If we want to minimize the product of all Zt’s, we can consider minimizing each of them. Let’s consider this as a function of αt, then by taking a derivative with respect to αt, we have
−(1−ϵt)exp[−αt]+ϵtexp[αt]=0 and
αt=12log1−ϵtϵt. Plugging this back into Zt, we have
Zt=2√ϵt(1−ϵt) Since ϵt(1−ϵt) can only attain maximum of 1/4, Zt must be smaller than 1. This makes the product Z1⋯ZT converging to 0. If we look at this more closely, by defining γt=12−ϵt as the improvement from a random model (with error 1/2), then
Zt=2√ϵt(1−ϵt)=√1−4γ2t≤exp[−2γ2t]
The last equation is because by Taylor expansion, exp[−4γ2t]≥1−4γ2t. Then, we can finally put all Zt’s together:
Training Error=n∑i=11{yi≠sign(FT(xi))}=n∑i=1exp[−yi≠FT(xi)]=Z1⋯ZT≤exp[−2T∑t=1γ2t],
which converges to 0 as long as ∑Tt=1γ2t accumulates up to infinite. But of course, in practice, it would increasing difficult find ft(x) that reduces the training error greatly.
19.3 Tuning the Number of Trees
Although we can get really low training classification error, this is subject to overfitting. The following code demonstrates what an overfitted looks like.
# One-dimensional classification example
n = 1000; set.seed(1)
x = cbind(seq(0, 1, length.out = n), runif(n))
py = (sin(4*pi*x[, 1]) + 1)/2
y = rbinom(n, 1, py)
plot(x[, 1], y + runif(n, -0.05, 0.05), pch = 19, ylim = c(-0.05, 1.05), cex = 0.5,
col = ifelse(y==1,"darkorange", "deepskyblue"), xlab = "x", ylab = "P(Y=1 | X=x)")
points(x[, 1], py, type = "l", lwd = 3)
# fit AdaBoost with bootstrapping, I am using a large shrinkage factor
gbm.fit = gbm(y~., data.frame(x, y), distribution="adaboost", n.minobsinnode = 2,
n.trees=200, shrinkage = 1, bag.fraction=0.8, cv.folds = 10)
# plot the decision function (Fx, not sign(Fx))
size=c(1, 5, 10, 20, 50, 100)
for(i in 1:6)
{
par(mar=c(2,2,3,1))
plot(x[, 1], py, type = "l", lwd = 3, ylab = "P(Y=1 | X=x)", col = "gray")
points(x[, 1], y + runif(n, -0.05, 0.05), pch = 19, cex = 0.5, ylim =c(-0.05, 1.05),
col = ifelse(y==1, "darkorange", "deepskyblue"))
Fx = predict(gbm.fit, n.trees=size[i]) # this returns the fitted function, but not class
lines(x[, 1], 1/(1+exp(-2*Fx)), lwd = 1)
title(paste("# of Iterations = ", size[i]))
}
Hence, selecting trees is necessary. For this purpose, we can use either the out-of-bag error to estimate the exponential upper bound, or simply do cross-validation.
## [1] 39
19.4 Gradient Boosting
Let’s take an alternative view of this problem, we use an additive structure to fit models
F_T(x) = \sum_{t = 1}^T \alpha_t f(x; \boldsymbol{\theta}_t)
by minimizing a loss function
\underset{\{\alpha_t, \boldsymbol{\theta}_t\}_{t=1}^T}{\min} \sum_{i=1}^n L\big(y_i, F_T(x_i)\big) In this framework, we may choose a loss function L that is suitable for the problem, and also choose the base learner f(x; \boldsymbol{\theta}) with parameter \boldsymbol{\theta}. Examples of this include linear function, spline, tree, etc.. While it maybe difficult to minimize over all parameters \{\alpha_t, \boldsymbol{\theta}_t\}_{t=1}^T, we may consider doing this in a stage-wise fashion. The algorithm could work in the following way:
- Set F_0(x) = 0
- For t = 1, \ldots, T
- Choose (\alpha_t, \boldsymbol{\theta}_t) to minimize the loss \underset{\alpha, \boldsymbol{\theta}}{\min} \,\, \sum_{i=1}^n L\big(y_i, F_{t-1}(x_i) + \alpha f(x_i; \boldsymbol{\theta})\big)
- Update F_t(x) = F_{t-1}(x) + \alpha_t f(x; \boldsymbol{\theta}_t)
- Output F_T(x) as the final model
The previous AdaBoost example is using exponential loss function. Also, it doesn’t pick an optimal f(x; \boldsymbol{\theta}) at each step. We just need a model that is better than random. The step size \alpha_t is optimized at each t given the fitted f(x; \boldsymbol{\theta}_t).
Another example is the forward stage-wise linear regression. In this case, we fit a single variable linear model at each step t:
f(x, j) = \text{sign}\big(\text{Cor}(X_j, \mathbf{r})\big) X_j * \mathbf{r} is the residual, as r_i = y_i - F_{t-1}(x_i) * j is the index that has the largest absolute correlation with \mathbf{r}
Then we give a very small step size \alpha_t, say, \alpha_t = 10^{-5}, and with sign equal to the correlation between X_j. In this case, F_t(x) is almost equivalent to the Lasso solution path, as t increases.
We may notice that r_i is in fact the negative gradient of the squared-error loss, as a function of the fitted function:
r_{it} = - \left[ \frac{\partial \, \big(y_i - F(x_i)\big)^2 }{\partial \, F(x_i)} \right]_{F(x_i) = F_{t-1}(x_i)} and we are essentially fitting a weak leaner f_t(x) to the residuals and update the fitted model F_t(x). The following example shows the result of using a tree leaner as f_t(x):
library(gbm)
# a simple regression problem
p = 1
x = seq(0, 1, 0.001)
fx <- function(x) 2*sin(3*pi*x)
y = fx(x) + rnorm(length(x))
plot(x, y, pch = 19, ylab = "y", col = "gray", cex = 0.5)
# plot the true regression line
lines(x, fx(x), lwd = 2, col = "deepskyblue")
We can see that the fitted model progressively approaximates the true function.
# fit regression boosting
# I use a very large shrinkage value for demonstrating the functions
# in practice you should use 0.1 or even smaller values for stability
gbm.fit = gbm(y~x, data = data.frame(x, y), distribution = "gaussian",
n.trees=300, shrinkage=0.5, bag.fraction=0.8)
# somehow, cross-validation for 1 dimensional problem creates error
# gbm(y ~ ., data = data.frame(x, y), cv.folds = 3) # this produces an error
# plot the fitted regression function at several iterations
par(mfrow=c(2,3))
size=c(1,5,10,50,100,300)
for(i in 1:6)
{
par(mar=c(2,2,3,1))
plot(x, y, pch = 19, ylab = "y", col = "gray", cex = 0.5)
lines(x, fx(x), lwd = 2, col = "deepskyblue")
# this returns the fitted function, but not class
Fx = predict(gbm.fit, n.trees=size[i])
lines(x, Fx, lwd = 3, col = "darkorange")
title(paste("# of Iterations = ", size[i]))
}
This idea can be generalized to any loss function L. This is the gradient boosting model:
- At each iteration t, calculate pseudo-residuals’’, i.e., the negative gradient for each observation g_{it} = - \left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F(x_i) = F_{t-1}(x_i)}
- Fit f_t(x, \boldsymbol{\theta}_t) to pseudo-residual g_{it}’s
- Search for the best \alpha_t = \underset{\alpha}{\arg\min} \sum_{i=1}^n L\big(y_i, F_{t-1}(x_i) + \alpha f(x_i; \boldsymbol{\theta}_t)\big)
- Update F_t(x) = F_{t-1}(x) + \alpha_t f(x; \boldsymbol{\theta}_t)
Hence, the only change when modeling different outcomes is to choose the loss function L, and derive the pseudo-residuals
- For regression, the loss is \frac{1}{2} (y - f(x))^2, and the pseudo-residual is y_i - f(x_i)
- For quantile regression to model median, the loss is |y - f(x)|, and the pseudo-residual is sign(y_i - f(x_i))
- For classification, we can use the negative log likelihood of a single observation - [ y\log(p) + (1-y)\log(1-p) ], and express p as the log-odds of a scale predictor, i.e., f = \log(p/(1-p)). Then the pseudo-residual is y_i - p(x_i)
19.5 Gradient Boosting with Logistic Link
To see how the pseudo-residual of a classification model is derived, let’s use the logistic link function the predicted probability p defined as:
p = \frac{e^{F(x)}}{1 + e^{F(x)}} = \frac{1}{1 + e^{-F(x)}}
The negative log-likelihood for the Bernoulli distribution for a single instance is given by:
L(y, p) = - [y \log(p) + (1 - y) \log(1 - p)]
Let’s first find the partial derivative of this loss function with respect to p:
\frac{\partial L(y, p)}{\partial p} = -\left[ y \frac{1}{p} - (1-y) \frac{1}{1-p} \right]
The derivative of p with respect to F(x) is:
\frac{\partial p}{\partial F(x)} = \frac{e^{-F(x)}}{(1 + e^{-F(x)})^2} = p(1-p)
Hence, the partial derivative of the loss function with respect to F is:
\begin{align*} \frac{\partial L(y, p)}{\partial F(x)} &= \frac{\partial L(y, p)}{\partial p} \cdot \frac{\partial p}{\partial F(x)}\\ &= -\left[ y \frac{1}{p} - (1-y) \frac{1}{1-p} \right] \cdot p(1-p)\\ &= -(y - y p + p - y p)\\ &= -(y - p) \end{align*}
Note that we should move F(x) to the negative gradient, then y_i - p_i is the pseudo-residual that we use in the boosting algorithm to fit the next tree / linear booster. The sign mainly influences the direction of the adjustment but is accounted for in the optimization process.