Chapter 23 Gradient Boosting Machines
The gradient boosting machine (GBM, J. H. Friedman 2001) is a powerful machine learning algorithm that can be used for any (generalized) regression task. It builds an ensemble of weak learners, typically decision trees, in a stage-wise manner to create a strong predictive model. The logic is similar to AdaBoost, but the way it defines the weak learners is different. The key idea behind GBM is to iteratively add models that correct the errors of the previous models by optimizing a specified loss function.
23.1 Motivation: Lasso as Boosting
We use a linear model to motivate gradient boosting via functional gradient descent. Consider the Lasso estimator (Zhao and Yu 2007):
\[ \widehat{\boldsymbol\beta}^{\text{lasso}} = \arg\min_{\boldsymbol\beta \in \mathbb{R}^p} \sum_{i=1}^n \big(y_i - \mathbf{x}_i^\top \boldsymbol\beta\big)^2 + \lambda \|\boldsymbol\beta\|_1. \]
Let \(F(x) = \mathbf{x}^\top \boldsymbol\beta\) denote the fitted function. At iteration \(t\), with
\[ F_{t-1}(x_i) = \mathbf{x}_i^\top \boldsymbol\beta^{(t-1)}, \]
and the residuals are
\[ r_i^{(t)} = y_i - F_{t-1}(x_i). \]
If we recall the forward stage-wise regression algorithm, which is the same as the Lasso path-wise coordinate descent algorithm, then with an infinitesimal step size, we can see that the update only involves the coordinate most aligned with the gradient:
\[ j_t = \arg\max_{j \in [p]} \left| \langle X_{\cdot j}, \mathbf r^{(t)} \rangle \right|, \]
and the update is essentially adding a small step in that direction, as long as the penalty \(\lambda\) is chosen such that it allows the corresponding \(\beta_j\) to grow:
\[ \beta_{j_t}^{(t)} = \beta_{j_t}^{(t-1)} + \varepsilon \cdot \operatorname{sign}\!\big(\langle X_{\cdot j_t}, \mathbf r^{(t)} \rangle\big). \]
On the other hand, we can view this from a functional gradient descent perspective, with basis functions being single coordinate functions, and the loss function is the squared-error loss, without the penalty term. In this case, the residual \(r_i^{(t)}\) can be interpreted as the negative gradient of the loss function with respect to the fitted value \(F(x_i)\):
\[ \begin{aligned} & -\frac{\partial (y_i - F(x_i))^2}{\partial F(x_i)}\Big|_{F = F_{t-1}} \\ =& -\frac{\partial (y_i - \theta)^2}{\partial \theta}\Big|_{\theta = F_{t-1}} \\ =& \, 2 \, (y_i - F(x_i)) \\ =& \, 2 \, r_i^{(t)} \end{aligned} \]
This means that at each iteration, we are fitting a weak learner (in this case, a single coordinate function) to the negative gradient of the loss function and updating our model accordingly. This perspective allows us to generalize the idea to other types of loss functions and weak learners, leading us to the concept of gradient boosting.
23.2 Gradient Boosting
The stage-wise view from forward regression suggests a broader principle: model fitting can be framed as functional gradient descent. Rather than updating a single pre-specified coefficient, we iteratively update the fitted function in the direction of the negative gradient of the loss.
Formally, we consider additive models of the form \[ F_T(x) = \sum_{t=1}^T \alpha_t f(x;\boldsymbol\theta_t), \]
where each base learner \(f(x;\boldsymbol\theta)\) may be a linear function, spline, or tree. The goal is to minimize
\[ \min_{\{\alpha_t,\boldsymbol\theta_t\}_{t=1}^T} \; \sum_{i=1}^n L\!\big(y_i, F_T(x_i)\big). \]
Direct optimization is difficult, but we can proceed in a stage-wise manner:
- Initialize: \(F_0(x)\), often a constant function.
- For \(t = 1, \ldots, T\):
- Compute pseudo-residuals (negative functional gradients): \[ r_{it} = - \left.\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right|_{F = F_{t-1}}. \]
- Fit a weak learner \(f_t(x)\) to the pairs \(\{(x_i, r_{it})\}_{i=1}^n\).
- Choose a step size \(\alpha_t\), typically by one-dimensional line search: \[ \alpha_t \in \arg\min_{\alpha}\sum_{i=1}^n L\!\big(y_i, F_{t-1}(x_i) + \alpha f_t(x_i)\big). \]
- Update the fitted model: \[ F_t(x) = F_{t-1}(x) + \alpha_t f_t(x). \]
- Output: the final model \(F_T(x)\).
The following example shows the result of using a tree learner as \(f_t(x)\):
library(gbm)
# a simple regression problem
x <- seq(0, 1, 0.001)
fx <- function(x) 2*sin(3*pi*x)
y <- fx(x) + rnorm(length(x))
plot(x, y, pch = 19, ylab = "y", col = "gray", cex = 0.5)
# plot the true regression line
lines(x, fx(x), lwd = 2, col = "deepskyblue")
We can see that the fitted model progressively approximates the true function.
# fit regression boosting
# (Using a large shrinkage here to visualize stagewise fits; in practice use 0.1 or smaller.)
gbm.fit <- gbm(y ~ x, data = data.frame(x, y), distribution = "gaussian",
n.trees = 300, shrinkage = 0.5, bag.fraction = 0.8)
# plot the fitted regression function at several iterations
par(mfrow=c(2,3))
size <- c(1, 5, 10, 50, 100, 300)
for (i in 1:6) {
par(mar=c(2,2,3,1))
plot(x, y, pch = 19, ylab = "y", col = "gray", cex = 0.5)
lines(x, fx(x), lwd = 2, col = "deepskyblue")
# additive score F(x)
Fx <- predict(gbm.fit, n.trees = size[i])
lines(x, Fx, lwd = 3, col = "darkorange")
title(paste("# of Iterations = ", size[i]))
}
23.3 Gradient Boosting with General Loss
The residual-as-gradient perspective extends beyond squared error. In general, let the empirical risk be
\[ R(F) = \sum_{i=1}^n L\!\big(y_i, F(x_i)\big). \]
At iteration \(t\), we compute the pseudo-residuals
\[ g_{it} = - \left.\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right|_{F(x_i) = F_{t-1}(x_i)}. \]
The stage-wise procedure is:
- Fit a weak learner \(f_t(x; \boldsymbol\theta_t)\) to \(\{(x_i, g_{it})\}_{i=1}^n\).
- Line search for step length: \[ \alpha_t = \arg\min_\alpha \sum_{i=1}^n L\!\big(y_i, F_{t-1}(x_i) + \alpha f_t(x_i)\big). \]
- Update: \[ F_t(x) = F_{t-1}(x) + \alpha_t f_t(x). \]
This framework is the gradient boosting, since each iteration moves the fitted function in the direction of the negative gradient of the loss functional. Here are some examples of loss functions \(L\) and their corresponding pseudo-residuals:
Squared error (regression)
\[ L(y,F) = \tfrac{1}{2}(y - F)^2 \quad\Rightarrow\quad g_{it} = y_i - F_{t-1}(x_i). \]Absolute error (quantile regression, median case)
\[ L(y,F) = |y - F| \quad\Rightarrow\quad g_{it} = \operatorname{sign}\!\big(y_i - F_{t-1}(x_i)\big). \]Logistic loss (binary classification) With \(F(x) = \log\!\big(p(x)/(1-p(x))\big)\), \[ L(y,F) = - \big[ y\log(p(x)) + (1-y)\log(1-p(x)) \big], \] \[ g_{it} = y_i - p_{t-1}(x_i), \quad p_{t-1}(x) = \frac{1}{1 + e^{-F_{t-1}(x)}}. \]
Exponential loss (AdaBoost) \[ L(y,F) = e^{-yF} \quad\Rightarrow\quad g_{it} = y_i e^{-y_i F_{t-1}(x_i)}. \]
Quantile loss (quantile regression)
For quantile level \(\tau \in (0,1)\), \[ L(y,F) = \begin{cases} \tau (y - F), & y > F, \\ (1-\tau)(F - y), & y \leq F, \end{cases} \quad\Rightarrow\quad g_{it} = \tau - \mathbf{1}(y_i \leq F_{t-1}(x_i)) \]
23.4 Logistic Link
To see how the pseudo-residual of a classification model is derived, let’s use the logistic link with the predicted probability \(p\) defined as: \[ p = \frac{e^{F(x)}}{1 + e^{F(x)}} = \frac{1}{1 + e^{-F(x)}}. \]
The negative log-likelihood for the Bernoulli distribution for a single instance is: \[ L(y, p) = - \big[y \log(p) + (1 - y) \log(1 - p)\big]. \]
First, the partial derivative of this loss with respect to \(p\) is: \[ \frac{\partial L(y, p)}{\partial p} = -\frac{y}{p} + \frac{1-y}{1-p}. \]
The derivative of \(p\) with respect to \(F(x)\) is: \[ \frac{\partial p}{\partial F(x)} = p(1-p). \]
Hence, the partial derivative of the loss with respect to \(F\) is: \[ \frac{\partial L(y, p)}{\partial F(x)} = \left(-\frac{y}{p} + \frac{1-y}{1-p}\right)\, p(1-p) = -y(1-p) + (1-y)p = p - y. \]
Gradient boosting uses the negative gradient as the working response, so the pseudo-residual is \[ g_{it} = -\left.\frac{\partial L}{\partial F(x_i)}\right|_{F=F_{t-1}} = y_i - p_{t-1}(x_i), \] which we use to fit the next tree / linear booster. The sign determines the direction of the update and is handled by using the negative gradient.
23.5 xgboost
xgboost (Chen and Guestrin 2016) is a popular and efficient implementation of gradient boosting that supports various loss functions and regularization techniques. It is widely used in machine learning competitions and real-world applications due to its speed and performance.
set.seed(2025)
n <- 2000
p <- 15
X <- matrix(rnorm(n*p), n, p)
f_true <- function(x) 2*sin(x[,1]) + 0.5*x[,2]^2 - 1.5*(x[,3]>0)
y <- f_true(X) + rnorm(n, sd = 1)
idx <- sample.int(n, floor(0.7*n))
dtrain <- xgboost::xgb.DMatrix(X[idx,], label = y[idx])
dvalid <- xgboost::xgb.DMatrix(X[-idx,], label = y[-idx])
params <- list(
objective = "reg:squarederror",
eval_metric = "rmse",
eta = 0.05, # shrinkage
max_depth = 4, # tree depth
subsample = 0.8, # row subsampling
colsample_bytree = 0.8, # column subsampling
min_child_weight = 5, # node min hessian-weighted size
lambda = 1.0, # L2 on leaf weights
alpha = 0 # L1 on leaf weights (optional)
)
watch <- list(train = dtrain, valid = dvalid)
fit <- xgboost::xgb.train(
params = params,
data = dtrain,
nrounds = 3000,
watchlist = watch,
early_stopping_rounds = 50,
verbose = 0
)
cat("Best iteration:", fit$best_iteration,
" Valid RMSE:", fit$best_score, "\n")
## Best iteration: 125 Valid RMSE: 1.089662
# Feature importance (gain-based)
imp <- xgboost::xgb.importance(model = fit, feature_names = paste0("x",1:p))
head(imp, 8)
## Feature Gain Cover Frequency
## <char> <num> <num> <num>
## 1: x1 0.52559558 0.20038568 0.13497823
## 2: x2 0.18380336 0.22652956 0.19883890
## 3: x3 0.16842717 0.11526674 0.08659894
## 4: x11 0.01344053 0.04193693 0.05418481
## 5: x10 0.01260888 0.04150826 0.05466860
## 6: x6 0.01253113 0.04700038 0.05805515
## 7: x12 0.01149845 0.03014653 0.05224964
## 8: x9 0.01088731 0.05297876 0.05805515
# Predictions and validation RMSE
pred <- predict(fit, dvalid)
rmse <- sqrt(mean((pred - y[-idx])^2))
rmse
## [1] 1.089662