In recent years xgboost
becomes extremely popular. The
main advantage is the computational performances. We will first
demonstrate the idea of boosting by setting a connection with Lasso with
linear weak learner. In this case, the model is similar to a stage-wise regression. After that, we demonstrate
the tree boosting idea. Let’s consider fitting a regression model by
aggregating a collection of “small” models
\[F_T(x) = \sum_{t = 1}^T \alpha_t f_t(x)\] Here we fit \(T\) models, where each small model \(f_t(x)\) receive a weight \(\alpha_t\). Here each \(f_t(x)\) is constructed progressively, meaning that we first fit \(f_1(x)\) and then figure out what would be the next most beneficial function \(f_2(x)\) to add on top of \(f_1(x)\), and iterate that process. This is different from random forests in which each \(f_t(x)\) are built independently and parallelly, with weight \(\alpha_t = 1/T\). For a linear boosting model, we are interested fitting a regression in the form of
\[Y \sim \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p.\]
However, we will not solve all parameters in just one step. Instead, we may view each step \(f_t(x)\) as growing one \(\beta_j\) parameter by a small magnitude \(\alpha_t\), say \(\alpha_t = 0.0001\). However, different \(f_t(x)\) on different \(t\) may involve different variable \(j\).
Assuming the observed data as \(\{y_i, x_i\}_{i=1}^n\), where \(x_i\) is a \(p\)-dimensional covariate vector. At step \(t = 1\), we initiate all \(\beta_j\)’s as 0. Hence the current residual is
\[\epsilon_i = y_i - 0 = y_i,\] Then, let’s find the variable with the highest absolute correlation with the current residual:
\[j_\ast = \arg\max_j \,\, | \text{corr}(\boldsymbol\epsilon, \, \mathbf{x}^{(j)}) | \] where \(\mathbf{x}_j = (x_1^{(j)}, x_2^{(j)}, \ldots, x_n^{(j)})^T\) is the \(n\)-vector of the \(j\)th variable, and \(\boldsymbol\epsilon = (\epsilon_1, \epsilon_2, \ldots, \epsilon_n)\) is the \(n\)-vector of residuals. We can then use
\[f_1(x) = \widehat\beta^\text{ols}_{j_\ast} x^{(j_\ast)} \] where \(\widehat\beta^\text{ols}_{j_\ast}\) is the OLS estimator that regresses the current residual \(\boldsymbol\epsilon\) on \(x^{(j_\ast)}\). But this is too aggressive. Hence, consider applying a step size \(\alpha_t\) to it, which will effectively make the growth of the \(j_\ast\)’th parameter small. This completes the first step \(t = 1\). Now, since we explained a bit of variation in the outcome, let’s remove that from the outcome and define the residual of each observations \(i\) in \(t = 2\) as
\[\boldsymbol\epsilon_i = y_i - \alpha_1 f_1(x_i).\] We then find the next best \(j_\ast\) to boost \(\beta_{j_\ast}\) by a small magnitude. In general, at the \((t+1)\)th step, we will always calculate residuals as
\[\epsilon_i = y_i - \sum_{k = 1}^t \alpha_t f_k(x_i) \] You can imaging that by doing this, we alternate the growth of \(\beta_j\)’s in many small steps. The following code demonstrate this in a one-dimensional problem. Since there is no other \(j\) to alternate, we simply grow the slope in a steady way.
library(xgboost)
set.seed(1)
p = 1
x = seq(-1, 1, 0.01)
fx <- function(x) 1*x + 2*x^3
y = fx(x) + rnorm(length(x))
# construct data for xgboost
traindata = xgb.DMatrix(data = data.matrix(x), label = y)
# we use linear booster with a step size 0.2
param = list(booster = "gblinear", eta = 0.2)
par(mfrow = c(2, 3))
par(mar = c(2, 2, 4, 2))
# try different number of T
for (nrounds in c(1, 3, 5, 7, 9, 11))
{
xgb.fit = xgb.train(param, traindata, nrounds = nrounds)
pred_y = predict(xgb.fit, traindata)
plot(x, y, pch = 19, col="darkorange", main = paste("nrounds =", nrounds))
points(x, pred_y, type = "l")
}
In a more rigorous way, the boosting algorithm is trying to minimize the loss function by iteratively fitting models to the “residual”. It should be easy to see that, if we use Mean Squared Error as the loss function in the regression case, then for a single instance, it is given by:
\[ L(y, F(x)) = (y - F(x))^2 \]
Let’s look at the partial derivative of this loss function with respect to \(F(x)\):
\[ \frac{\partial L(y, F(x))}{\partial F(x)} = -2(y - F(x)) \]
In the context of gradient boosting, the pseudo-residual \(r\) for each instance \(i\) is the negative gradient of the loss function evaluated at the current prediction \(F(x)\):
\[ r_i = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} = 2(y_i - F(x_i)) \]
Often, this is simplified further to:
\[ r_i = y_i - F(x_i) \]
Here, \(F(x_i)\) is the current prediction for instance \(i\) before the new weak learner is added to the ensemble.
Tree boosting is not really much different, expect that at each step,
we can fit a regression tree model based on the current residual. We may
multiply the tree model by a small magnitude \(\eta\) to prevent growing the model too
fast. We will use the gbm
package as an example. You can
further control the complexity of trees through tuning parameters.
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
# a simple regression problem
p = 1
x = seq(0, 1, 0.001)
fx <- function(x) 2*sin(3*pi*x)
y = fx(x) + rnorm(length(x))
plot(x, y, pch = 19, ylab = "y", col = "gray")
lines(x, fx(x), lwd = 2) # plot the true regression line
# fit regression boosting
# I use a very large shrinkage value for demonstrating the functions
# in practice you should use 0.1 or even smaller values for stability
gbm.fit = gbm(y~x, data = data.frame(x, y), distribution = "gaussian",
n.trees=300, shrinkage=0.5, bag.fraction=0.8)
# somehow, cross-validation for 1 dimensional problem creates error
# gbm(y ~ ., data = data.frame(x, y), cv.folds = 3) # this produces an error
# plot the fitted regression function at several iterations
par(mfrow=c(2,3))
size=c(1,5,10,50,100,300)
for(i in 1:6)
{
par(mar=c(2,2,3,1))
plot(x, y, pch = 19, ylab = "y", col = "gray")
lines(x, fx(x), lwd = 2)
Fx = predict(gbm.fit, n.trees=size[i]) # this returns the fitted function, but not class
lines(x, Fx, lwd = 3, col = "red")
title(paste("# of Iterations = ", size[i]))
}
For classification problems, we do not directly have the residual. However, we could think of the residual as coming from the contribution to the likelihood function, so the goal becomes minimizing the negative log-likelihood (analog to \(\ell_2\) loss). So let’s use the logistic link function the predicted probability \(p\) is often defined as:
\[ p = \frac{1}{1 + e^{-F(x)}} \]
The negative log-likelihood for the Bernoulli distribution for a single instance is given by:
\[ L(y, p) = - [y \log(p) + (1 - y) \log(1 - p)] \]
Let’s first find the partial derivative of this loss function with respect to \(p\):
\[ \frac{\partial L(y, p)}{\partial p} = -\left[ y \frac{1}{p} - (1-y) \frac{1}{1-p} \right] \]
The derivative of \(p\) with respect to \(F(x)\) is:
\[ \frac{\partial p}{\partial F(x)} = \frac{e^{-F(x)}}{(1 + e^{-F(x)})^2} = p(1-p) \]
Hence, the partial derivative of the loss function with respect to \(F\) is:
\[\begin{align*} \frac{\partial L(y, p)}{\partial F(x)} &= \frac{\partial L(y, p)}{\partial p} \cdot \frac{\partial p}{\partial F(x)}\\ &= -\left[ y \frac{1}{p} - (1-y) \frac{1}{1-p} \right] \cdot p(1-p)\\ &= -(y - y p + p - y p)\\ &= -(y - p) \end{align*}\]
The \(y_i - p_i\) is the pseudo-residual that we use in the boosting algorithm to fit the next tree / linear booster. The sign change is because we move to the negative gradient in a minimization problem.
AdaBoost is a boosting model for classification. The idea is slightly different from regression since we cannot directly calculate the residuals. Instead, Adaboost defines weights on each observation, and progressively update weights. Note that observations with larger weights is effectively similar as observations with large residuals, such that the \(f_t(x)\) in the next iteration will try to address them (fit them bitter). The following code demonstrate AdaBoost.
n = 1000; set.seed(1)
x = cbind(seq(0, 1, length.out = n), runif(n))
py = (sin(4*pi*x[, 1]) + 1)/2
y = rbinom(n, 1, py)
plot(x[, 1], y + ifelse(y == 1, runif(n, -0.15, 0), runif(n, 0, 0.15)),
pch = 19, col = ifelse(y==1, "red", "blue"), xlab = "x", ylab = "P(Y=1 | X=x)")
points(x[, 1], py, type = "l", lwd = 3)
# fit AdaBoost with bootstrapping, again I am using a large shrinkage factor
gbm.fit = gbm(y~., data.frame(x, y), distribution="adaboost", n.minobsinnode = 2,
n.trees=200, shrinkage = 1, bag.fraction=0.8, cv.folds = 10)
# get the best number of trees from cross-validation (or oob if no cv is used)
gbm.perf(gbm.fit)
## [1] 22
# plot the decision function (Fx, not sign(Fx))
par(mfrow=c(2,3))
size=c(1, 5, 10, 22, 50, 100)
for(i in 1:6)
{
par(mar=c(2,2,3,1))
plot(x[, 1], py, type = "l", lwd = 3, ylab = "P(Y=1 | X=x)", col = "gray")
points(x[, 1], y + ifelse(y == 1, runif(n, -0.15, 0), runif(n, 0, 0.15)),
pch = 19, col = ifelse(y==1, "red", "blue"))
Fx = predict(gbm.fit, n.trees=size[i]) # this returns the fitted function, but not class
lines(x[, 1], 1/(1+exp(-2*Fx)), lwd = 3)
title(paste("# of Iterations = ", size[i]))
}
# You can peek into a single tree
pretty.gbm.tree(gbm.fit, i.tree = 1)
shrinkage
# you may use a very small shrinkage factor to improve stability
# however, this means you will need to use more trees
gbm.fit = gbm(y~., data.frame(x, y), distribution="adaboost", n.minobsinnode = 2,
n.trees=1000, shrinkage = 0.1, bag.fraction=0.8, cv.folds = 10)
gbm.perf(gbm.fit)
## [1] 487
par(mfrow=c(2,3))
size=c(1,10,50,100,250,500) # 1000 trees is clearly over-fitting
for(i in 1:6)
{
par(mar=c(2,2,3,1))
plot(x[, 1], py, type = "l", lwd = 3, ylab = "P(Y=1 | X=x)", col = "gray")
points(x[, 1], y + ifelse(y == 1, runif(n, -0.15, 0), runif(n, 0, 0.15)),
pch = 19, col = ifelse(y==1, "red", "blue"))
Fx = predict(gbm.fit, n.trees=size[i]) # this returns the fitted function, but not class
lines(x[, 1], 1/(1+exp(-2*Fx)), lwd = 3)
title(paste("# of Iterations = ", size[i]))
}
# our previous example with a circle
set.seed(1)
n = 500
x1 = runif(n, -1, 1)
x2 = runif(n, -1, 1)
y = rbinom(n, size = 1, prob = ifelse(x1^2 + x2^2 < 0.6, 0.8, 0.2))
xgrid = expand.grid(x1 = seq(-1, 1, 0.01), x2 = seq(-1, 1, 0.01))
# if we only use one variable in each tree
gbm.fit = gbm(y~., data = data.frame(x1, x2, y), distribution="adaboost",
n.trees=1000, shrinkage=0.01, bag.fraction=0.8,
interaction.depth = 1, cv.folds=10)
usetree = gbm.perf(gbm.fit, method="cv")
Fx = predict(gbm.fit, xgrid, n.trees=usetree)
pred = matrix(1/(1+exp(-2*Fx)) > 0.5, 201, 201)
par(mar=rep(2,4))
contour(seq(-1, 1, 0.01), seq(-1, 1, 0.01), pred, levels=0.5, labels="",axes=FALSE)
points(x1, x2, col = ifelse(y == 1, "deepskyblue", "darkorange"),
pch = 19, yaxt="n", xaxt = "n")
points(xgrid, pch=".", cex=1.2, col=ifelse(pred, "deepskyblue", "darkorange"))
box()
pretty.gbm.tree(gbm.fit, i.tree = 1)
We could also allow interactions in each tree. This will lead to deeper tree and it will take less number of iterations to achieve similar performances.
gbm.fit = gbm(y~., data = data.frame(x1, x2, y), distribution="adaboost",
n.trees=1000, shrinkage=0.01, bag.fraction=0.8,
interaction.depth = 2, cv.folds=10)
usetree = gbm.perf(gbm.fit, method="cv")
Fx = predict(gbm.fit, xgrid, n.trees=usetree)
pred = matrix(1/(1+exp(-2*Fx)) > 0.5, 201, 201)
par(mar=rep(2,4))
contour(seq(-1, 1, 0.01), seq(-1, 1, 0.01), pred, levels=0.5, labels="",axes=FALSE)
points(x1, x2, col = ifelse(y == 1, "deepskyblue", "darkorange"),
pch = 19, yaxt="n", xaxt = "n")
points(xgrid, pch=".", cex=1.2, col=ifelse(pred, "deepskyblue", "darkorange"))
box()
pretty.gbm.tree(gbm.fit, i.tree = 1)
You can estimate the Variable importance with similar fashion as a random forest. Since we can calculate the reduced accuracy by randomly permuting the values of out-of-bag data, we may then aggregate these measures over all trees.
gbm.fit = gbm(y~., data = data.frame(x1, x2, matrix(runif(n*10), n, 10), y),
distribution="adaboost", n.trees=1000, shrinkage=0.01,
bag.fraction=0.8, interaction.depth = 2, cv.folds=10)
summary(gbm.fit, method = permutation.test.gbm)