The normality of random forests remains an active area of research. In fact, its assessment may depend on various assumptions concerning the construction of the splitting rule and associated tuning parameters (see, e.g., Mentch and Hooker (2016) and Athey, Tibshirani, and Wager (2019)). If one generally accepts that random forests are asymptotically normal, then accurate estimation of their variance would enable the construction of confidence intervals for both predictions and other associated quantities. Therefore, this note focuses on variance estimation for random forests. We introduce three main tools for obtaining an estimation of the variance of random forests:
Note that all approaches can be applied to various quantities of interest in random forest, and even beyond random forests. But we discuss their utilization in different tasks.
It is well known that the bootstrap approach can be used to approximate the sampling distribution. Mathematically, if we denote the observed data as \(X_1, X_2, \ldots, X_n\), drawn independently from an underlying distribution \(\mathbf{P}\), then we define the empirical distribution of this sample as \(\mathbf{P}_n = n^{-1} \sum_{i = 1}^n \delta(X_i)\), where \(\delta\) places a point mass at \(X_i\) and is 0 otherwise. The following is an example of the empirical estimation of the c.d.f of a univariate variable.
# obtain some samples from normal
n = 100
x = rnorm(n)
# calculate the empirical cdf using apply
u = seq(-3.5, 3.5, length.out = 200)
mycdf <- function(u, X)
{
cdfe = rep(0, length(u))
for (i in 1:length(u))
cdfe[i] = mean(X <= u[i])
return(cdfe)
}
Fn = mycdf(u, x)
# plot the estimated cdf
plot(u, Fn, type = "l", xlab = "x", ylab = "F(x)")
# the x bar and confidence interval of x bar
mean(x)
## [1] -0.01861898
c(mean(x) - 1.96*sd(x)/sqrt(n), mean(x) + 1.96*sd(x)/sqrt(n))
## [1] -0.2206543 0.1834164
Since we can expect the sampling distribution \(\mathbf{P}_n\) to converge to the true
distribution \(\mathbf{P}\) as \(n\) grows, we can then anticipate that
random and independent draws with replacement
(bootstrap samples
) from the distribution \(\mathbf{P}_n\) will also converge to \(\mathbf{P}\). This comes with important
consequences that
Here is an example for estimating the variance of \(\bar{x}\), but using bootstrap samples:
# plot the estimated cdf
plot(u, Fn, type = "l", xlab = "x", ylab = "F(x)", lwd = 2, col = "red")
# bootstrapped means
nbs = 100
xbarb = rep(0, nbs)
for (i in 1:nbs)
{
xb = sample(x, replace = TRUE)
points(u, mycdf(u, xb), type = "s", col = "gray")
xbarb[i] = mean(xb)
}
# bootstrap estimation of 90% confidence interval of xbar
quantile(xbarb, c(0.05, 0.95))
## 5% 95%
## -0.1634821 0.1565188
This idea of bootstrapping can be applied to random forests. By fitting random forests to these bootstrap samples, we can approximate the distribution of the random forest estimator \(\hat{f}_\text{RF}(x)\) itself. A naive procedure may work as follows:
However, this approach may encounter significant issues for some
quantities of interest, such as variable importance. The main issue is
that bootstrap samples contain replicates of the same observation.
Hence, in a bootstrap sample, the same observation may appear in both
the training examples (in-bag
data) and the out-of-bag
(oobag
) data in a given tree. This introduces bias into the
variable importance calculation because the in-bag and oobag data are
not independent.
To address this issue, Ishwaran and Lu (2019) proposed two
approaches. One is called the .164 bootstrap estimator. The main idea is
straightforward: when using a bootstrap sample to fit a random forest,
remove the observations from the oobag data if they are already
used in the in-bag tree fitting. Based on some simple
mathematical approximations, this results in approximately \(0.164 n\) unique observations in the oobag
set of each tree, and the resulting estimator is called a
double-bootstrap
estimator (Ishwaran and Lu 2019) for
variable importance. This approach is applicable to any type of random
forest. An example using the randomForestSRC
package is
presented. In this code, we have 10 variables, while only the first and
third variables are useful.
set.seed(1)
n = 300
p = 30
x = matrix(rnorm(n*p), n, p)
y = x %*% c(1, 0, 1, rep(0, p-3)) + rnorm(n)
library(randomForestSRC)
# training the forest
reg.o <- rfsrc(y ~ ., data.frame(x, y))
# the .164 double bootstrap approach
reg.dbs.o <- subsample(reg.o, B = 100, bootstrap = TRUE, verbose = FALSE)
plot.subsample(reg.dbs.o)
Whats wrong with taking boostrapped trees from a fitted random forest and use their distribution of variable importance?
The Jackknife is another approach for estimating the variance of an estimator. The main idea is to remove one observation at a time and observe how much impact it has to change the estimator. The Jackknife estimator is closely related to the influence function, which can be used to quantify the variance of any estimator and the Jackknife is an empirical approximation to the influence function. Then the variance of the estimator can be quantified as the sum of such impacts. A general formula is given by the following
\[ \hat{\text{V}}_\text{J}(\hat \theta_\ast) = \frac{n-1}{n} \sum_{i = 1}^{n} \left( \hat \theta_{(-i)} - \hat \theta_\ast \right)^2, \]
where \(\hat \theta_\ast\) is the average of all such remove-one estimators, i.e.,
\[ \hat\theta_\ast = \frac{1}{n} \sum_i \hat \theta_{(-i)}, \]
and \(\hat \theta_{(-i)}\) is the estimator calculated with observation \(i\) removed. For calculating the mean of a set of samples, you can check that \(\hat\theta_\ast\) is simply the \(\bar x\). The following code performs this calculation.
# obtain some samples from normal
n = 100
x = rnorm(n)
# calculating the influence function
influence = rep(0, length(x))
xbar = mean(x) # theta_*
for (i in 1:length(x))
{
xbar_no_i = mean(x[-i])
influence[i] = xbar_no_i - xbar
}
# jackknife variance estimation
var_est = sum(influence^2) * (length(x) - 1) / (length(x))
# compare with the theoretical estimation
var_est
## [1] 0.008219851
var(x) / length(x)
## [1] 0.008219851
Surprisingly (or maybe not?), the jackknife variance estimator is identical to the theoretical estimator of the variance of \(\hat \theta = \bar x\) when \(\hat \theta\) is a linear estimator of the samples. In this case, the variance of the estimator is exactly contributed by the variance from each observation. We will use this logic for estimating the confidence interval of prediction later one. But for now, let’s go back to the application of variable importance with a slight extension.
Since the oobag data only contain a small fraction of the total
sample, the variable importance evaluation in the bootstrap version may
be inefficient and unstable. Then you may need to perform a lot of
bootstrap samples to obtain the variance estimation. One alternative is
to use subsamples, with size \(n - d\),
instead of the bootstrap samples to fit each replicate of the random
forest and approximate the variance using their deviations to the full
sample random forests. In this case, the size \(n - d\) subsamples would not contain
replicates, and we can still use a regular random forest to fit the
subsample. And you will get approximately \((1
- 0.632)(n - d)\) oobag observations in each tree. The
delete-d jackknife
estimator (Shao and Wu 1989) is a variant of
the traditional jackknife estimator. The formula (if we run all possible
\(n - d\) sets) is given by
\[ \hat{\text{V}}_{\text{J(d)}}(\hat \theta_n) = \frac{n-d}{d} {\binom{n}{n-d}}^{-1} \sum_{|S| = n - d} (\hat \theta_{(-S)} - \bar \theta_{n, d})^2 \]
where \(\bar \theta_{n, d}\) is the average of all such delete-\(d\) estimators, similar to the jackknife. However, in practice, we can either treat \(\bar \theta_{n, d}\) as the random forest fitted to the entire sample and ignore their differences. The fraction \(\frac{n-d}{d}\) is a scaling factor to correct for the reduced sample size. In practice, we only need to run a finite number of subsets, rather than all possible size \(n-d\) subsamples, to obtain an approximation of it. Here is an example code using the previously generated data.
# Delete-d jackknife estimator
reg.smp.o <- subsample(reg.o, verbose = FALSE)
plot.subsample(reg.smp.o)
Use the Boston housing data to fit a random forest. Then estimate the variance of the variable importance measure, select a smaller set of significant variables and refit the model with the selected variables.
data("Boston", package = "MASS")
# training the forest
reg.boston <- rfsrc(medv ~ ., Boston, ntree = 1000, mtry = 13, nodesize = 5, nsplit = 5)
tail(reg.boston$err.rate, 1)
## [1] 12.39758
# Delete-d jackknife estimator
reg.smp.boston <- subsample(reg.boston, B = 200, bootstrap = FALSE, verbose = FALSE)
plot.subsample(reg.smp.boston)
# refit the model
reg.boston <- rfsrc(medv ~ lstat + rm + dis + crim + nox + rad + tax, data = Boston,
ntree = 1000, mtry = 6, nodesize = 6, nsplit = 5)
tail(reg.boston$err.rate, 1)
## [1] 12.50151
While the previous approaches can estimate the variance of quantities in a random forests, they still requires fitting many random forests, which can be computationally inefficient. However, if we look more closely inside a random forests, it already try to create many variations of subsamples or bootstrap samples using the trees. Although we cannot simply resample these trees to calculate the variance, we might be able to utilize the mechanism in some other ways. For this section, we focus on the variance estimation of a different task: predicting a new target point \(x_0\). We start with the Jackknife procedure proposed in Wager, Hastie, and Efron (2014).
The idea of this Jackknife procedure can be easily incorporated into quantifying the variance of a prediction, rather than variable importance. Of course this comes with a little bit of twist. But let’s first observe the Jackknife again if we view the random forest \(\hat f\) as a prediction (parameter) at a new target point \(x_0\):
\[ \hat{\text{V}}_\text{J}\big(\hat f_\text{RF}(x_0) \big) = \frac{n-1}{n} \sum_{i = 1}^{n} \left( \hat f_{(-i)}(x_0) - \hat f_\text{RF}(x_0) \right)^2. \]
Here, \(\hat f_\text{RF}(x_0)\) is the random forest estimator built using all observations, and it is the average of \(B\) number of trees (using different samples of the original \(n\) samples):
\[ \hat f_\text{RF}(x_0) = \frac{1}{B} \sum_b \hat f_b(x_0), \]
The quantity \(\hat f_{(-i)}(x_0)\) is also a random forest prediction, but is built with observations without \(X_i\). Then the question is, can we estimate the variance with just one fitted random forest? Consider the following strategy:
This is the so-called Jackknife-after-Bootstrap
estimate
proposed by Efron (1992). This procedure is
implemented in the ranger
package.
set.seed(1)
ntrain = 600
ntest = 100
n = ntrain + ntest
p = 30
x = matrix(rnorm(n*p), n, p)
b = c(0.5, 0, 0.5, rep(0, p-3))
y = x %*% b + rnorm(n)
traindata = data.frame(x, y)[1:ntrain, ]
testdata = data.frame(x, y)[-(1:ntrain), ]
# calculate confidence interval
library(ranger)
ranger.fit <- ranger(y~., data = traindata, num.trees = 5000,
keep.inbag=TRUE, mtry = 30, min.node.size = 10)
# Infinitesimal Jackknife (infjack) or jackknife-after-bootstrap (jack)
ranger.pred <- predict(ranger.fit, data = testdata,
type = "se", se.method = "jack")
# the true link function
testlink = data.matrix(testdata[, 1:p]) %*% b
# coverage
lowerbound = ranger.pred$predictions - 1.96*ranger.pred$se
upperbound = ranger.pred$predictions + 1.96*ranger.pred$se
cover = (testlink <= upperbound) & (testlink >= lowerbound)
# plot results
plot(testlink, ranger.pred$predictions,
xlim = c(-2, 2), ylim = c(-2, 2),
pch = 19, col = ifelse(cover, "green", "red"),
xlab = "True Mean", ylab = "Estimated Mean and 95% CI")
lines(x = c(-2, 2), y = c(-2, 2))
# the confidence interval
for (i in 1:nrow(testdata))
lines(x = c(testlink[i], testlink[i]),
y = c(lowerbound[i], upperbound[i]))
## average length
mean(ranger.pred$se)
## [1] 0.311566
A slight variation of this approach is the
Infinitesimal Jackknife
estimator (Efron
2014), which utilize the correlation between the number of
times an observation appears in the random forest with the forest
prediction. The motivation is that, when deleting just one observation,
the rest of the trees still have many uncertainties involved. To
estimate the influence of adding/subtracting one observation, we could
approximate it by looking at the correlation between how many times a
tree involves a subject \(X_i\) and the
tree prediction. The formula is given by
\[ \hat{\text{V}}_\text{IJ}\big(\hat f_\text{RF}(x_0)\big) = \sum_{i = 1}^{n} \widehat{ \text{Cov} }_i^2 = \sum_{i = 1}^{n} \left[ \frac{\sum_b (N_{bi} - 1) \left( \hat f_b(x) - \hat f_\text{RF}(x) \right) }{B} \right]^2, \]
where \(N_{bi}\) is the number of replicates of observation \(i\) in the \(b\)-th tree, and \(\hat f_b(x_0)\) is the \(b\)-th tree model prediction at \(x_0\). The random forest version of this estimator is also proposed in Wager, Hastie, and Efron (2014). It should be noted that the above two estimators can be badly biased upwards. However, with large number of \(B\) the bias can be slightly reduced (large \(B\) is always recommended in practice). Some bias-corrections were also introduced in the paper (also see Sexton and Laake (2009)). However, in practice, Jackknife related estimators often produce over estimations of the variance. One reason is that it relies on the behavior of the estimator being approximately linear and smooth. The following code performs predictions and confidence intervals using the Infinitesimal Jackknife.
# Infinitesimal Jackknife (infjack) or jackknife-after-bootstrap (jack)
ranger.pred <- predict(ranger.fit, data = testdata,
type = "se", se.method = "infjack")
# the true link function
testlink = data.matrix(testdata[, 1:p]) %*% b
# coverage
lowerbound = ranger.pred$predictions - 1.96*ranger.pred$se
upperbound = ranger.pred$predictions + 1.96*ranger.pred$se
cover = (testlink <= upperbound) & (testlink >= lowerbound)
# plot results
plot(testlink, ranger.pred$predictions,
xlim = c(-2, 2), ylim = c(-2, 2),
pch = 19, col = ifelse(cover, "green", "red"),
xlab = "True Mean", ylab = "Estimated Mean and 95% CI")
lines(x = c(-2, 2), y = c(-2, 2))
# the confidence interval
for (i in 1:nrow(testdata))
lines(x = c(testlink[i], testlink[i]),
y = c(lowerbound[i], upperbound[i]))
## average length
mean(ranger.pred$se)
## [1] 0.2509218
Use the Boston
dataset to predict medv
using four variables lstat
, rm
,
dis
, and crim
. Predict a new listing with
lstat
= 2, rm
= 8, dis
= 6 and
crim
= 0.01. It should have relatively high prediction.
# training the forest
ranger.boston <- ranger(medv ~ lstat + rm + dis + crim, data = Boston,
num.trees = 10000, keep.inbag=TRUE)
testboston = data.frame("lstat" = 2,
"rm" = 8,
"dis" = 6,
"crim" = 0.01)
ranger.pred <- predict(ranger.boston, data = testboston,
type = "se", se.method = "infjack")
## Warning in rInfJack(pred = result$predictions, inbag = inbag.counts, used.trees = 1:num.trees): Sample size <=20, no calibration performed.
c(ranger.pred$predictions - 1.96 * ranger.pred$se,
ranger.pred$predictions + 1.96 * ranger.pred$se)
## [1] 45.21598 49.62977
U-statistics (Hoeffding 1948) provide a framework for unbiased estimation of population parameters, particularly for dependent and identically distributed (i.i.d) random variables. The general form of a U-statistic for a sample \((X_1, X_2, \ldots, X_n)\) is:
\[ U_n = {{n \choose k}}^{-1} \sum_{1 \leq i_1 < i_2 < \ldots < i_k \leq n} h(X_{i_1}, X_{i_2}, \ldots, X_{i_k}), \]
where we call \(h(\cdot)\) a kernel function, which is an unbiased estimator of some parameter \(\theta\), i.e.,
\[ E[ h(X_{i_1}, X_{i_2}, \ldots, X_{i_k}) ] = \theta. \]
In our case, we can view \(h(\cdot)\) as a tree estimator using some subsamples, and the random forest estimator is the mean of many such estimators. To simplify the notation, let \(S_i\) represent the \(i\)-th set of samples with size \(k\) from the \(n\) observations. Here, \(i\) ranges from 1 to \(\binom{n}{k}\), the total number of such sets. The U-statistic \(U_n\) can then also be written as:
\[ U_n = {\binom{n}{k}}^{-1} \sum_{i} h(S_i) \]
To give a simple example, we start with the U-statistic representation of sample variance, denoted as \(U_n\):
\[\begin{equation} U_n = {{n \choose 2}}^{-1} \sum_{1 \leq i < j \leq n} \frac{(X_i - X_j)^2}{2} \end{equation}\]
To directly relate this to the sample variance, we complete the summation to all \(i\) and \(j\), and then add and subtract the sample mean \(\bar{X}\) inside the square term. And then
\[\begin{align*} U_n &= {{n \choose 2}}^{-1} \frac{1}{4} \sum_{i, j} ( X_i - X_j )^2 \\ &= \frac{1}{2n(n-1)} \sum_{i, j} ( X_i - \bar{X} + \bar{X} - X_j )^2 \\ &= \frac{1}{2n(n-1)} \sum_{i, j} \left\{ (X_i - \bar{X})^2 + (X_j - \bar{X})^2 + 2(X_i - \bar{X})(X_j - \bar{X}) \right\} \\ &= \frac{n}{n(n-1)} \sum_{i=1}^n (X_i - \bar{X})^2 + 0\\ &= \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2, \end{align*}\]
which is exactly the unbiased variance estimator.
For random forests, we can naturally view it as a U-statistics, by treating a kernel function \(h(S_i)\) as a tree estimator of predicting a target point \(x\), and \(S_i\) is a size \(k\) subsample from the \(n\) observations. Hence, if we exhaust all possible subsamples, a random forest can be written as
\[ \hat f_\text{RF}(x) = U_n = {\binom{n}{k}}^{-1} \sum_{|S_i| = k} h(x, S_i) \]
Then, the variance of such a formulation can be quantified via the Hoeffding’s decomposition (Hoeffding 1948):
\[ \text{Var}\left(U_{n}\right) = \binom{n}{k}^{-1} \sum_{d=1}^{k} \binom{k}{d} \binom{n-k}{k-d} \xi^2_{d, k}, \]
where \(\xi^2_{d, k}\) is the covariance between two kernels \(h(S_1)\) and \(h(S_2)\) with \(S_1\) and \(S_2\) sharing \(d\) overlapping observations, i.e., \(\xi_{d, k}^2 = \text{Cov} \left( h(S_1), h(S_2) \right)\), with \(|S_1 \cap S_2| = d\). Here both \(S_1\) and \(S_2\) are size-\(k\) subsamples. Alternatively, we can represent \(\xi_{d, k}^2\) as (Lee 2019)
\[ \xi_{d, k}^2 = \text{Var} \left[ E \left (h(S) | X_1, ..., X_d \right) \right]. \]
A key phenomenon we could utilize is that when \(k\) is small relative to \(n\), the first term of this Hoeffding’s decomposition will dominate the entire sum because \(\binom{k}{d} \binom{n-k}{k-d}\) is so larger compared with all other terms. This means that we only need to consider \(\xi_{1, k}^2\), the covariance of two tree estimators if they share just one sample. This approach was illustrated in Mentch and Hooker (2016), who proposed this U-statistics view of random forest. Several difficulties are involved in this formulation. Mainly, when \(k\) is large, the first term does not dominate and one needs to estimate many of such terms, which is computationally infeasible. Although some corrections have been proposed (Zhou, Mentch, and Hooker 2021), they do not unbiasedly estimate the variance for large \(k\). However, Xu, Zhu, and Shao (2022) fixed the issue by using the conditional variance formula:
\[\begin{align*} \xi_{d, k}^2 =& \text{Var} \left[ E \left (h(S) | X_1, ..., X_d \right) \right] \\ =& \text{Var}(h(S)) - E\left[\text{Var}(h(S) | X_1, ..., X_d) \right], \end{align*}\]
where the first term can be unbiased estimated with trees that do not share any samples, and the second term can be estimated by trees that share \(d\) samples. Further combining this trick for all terms in the Hoeffding’s decomposition, we can have a simple formula of an unbiased estimation of the variance:
\[ \text{Var}\left(U_{n}\right) = \text{V}^{(h)} - \text{V}^{(s)}, \]
in which the first term is the variance of trees (with independent samples) and the second term is the sample variance of trees if their training samples are subsamples of the original \(n\) observations. Here is an example using the RLT package. One limitation of this approach is that it only allows \(k\) to be as large as half of \(n\).
# generate the same dataset as previous example
set.seed(1)
ntrain = 600
ntest = 100
n = ntrain + ntest
p = 30
x = matrix(rnorm(n*p), n, p)
b = c(0.5, 0, 0.5, rep(0, p-3))
y = x %*% b + rnorm(n)
traindata = data.frame(x, y)[1:ntrain, ]
testdata = data.frame(x, y)[-(1:ntrain), ]
library(RLT)
## RLT and Random Forests v4.2.6
## pre-release at github.com/teazrq/RLT
RLT.fit <- RLT(traindata[, 1:p], traindata$y, model = "regression",
ntrees = 20000, mtry = 30, nmin = 5,
resample.replace = FALSE, resample.prob = 0.5,
param.control = list("var.ready" = TRUE,
"resample.track" = TRUE),
verbose = TRUE)
## Regression Random Forest ...
## ---------- Parameters Summary ----------
## (N, P) = (600, 30)
## # of trees = 20000
## (mtry, nmin) = (30, 5)
## split generate = Random, 1
## sampling = 0.5 w/o replace
## (Obs, Var) weights = (No, No)
## importance = none
## reinforcement = No
## ----------------------------------------
RLT.Pred <- predict(RLT.fit, testdata, var.est = TRUE)
# coverage
lowerbound = RLT.Pred$Prediction - 1.96*sqrt(RLT.Pred$Variance)
upperbound = RLT.Pred$Prediction + 1.96*sqrt(RLT.Pred$Variance)
cover = (testlink <= upperbound) & (testlink >= lowerbound)
# plot results
plot(testlink, RLT.Pred$Prediction,
xlim = c(-2, 2), ylim = c(-2, 2),
pch = 19, col = ifelse(cover, "green", "red"),
xlab = "True Mean", ylab = "Estimated Mean and 95% CI")
lines(x = c(-2, 2), y = c(-2, 2))
# the confidence interval
for (i in 1:nrow(testdata))
lines(x = c(testlink[i], testlink[i]),
y = c(lowerbound[i], upperbound[i]))
## average length is typically smaller than Jackknife
## you may occasionally get negative estimations...
mean(sqrt(RLT.Pred$Variance), na.rm = TRUE)
## [1] 0.1665776
Use the Boston
dataset to predict medv
using four variables lstat
, rm
,
dis
, and crim
(columns 1, 6, 8, and 13).
Predict a new listing with lstat
= 2, rm
= 8,
dis
= 6 and crim
= 0.01. It should have
relatively high prediction.
data("Boston", package = "MASS")
# training the forest
library(RLT)
RLT.boston <- RLT(x = Boston[, c(1, 6, 8, 13)], y = Boston$medv,
ntrees = 10000, mtry = 4, nmin = 5,
resample.replace = FALSE, resample.prob = 0.5,
param.control = list("var.ready" = TRUE,
"resample.track" = TRUE),
verbose = TRUE)
## Regression Random Forest ...
## ---------- Parameters Summary ----------
## (N, P) = (506, 4)
## # of trees = 10000
## (mtry, nmin) = (4, 5)
## split generate = Random, 1
## sampling = 0.5 w/o replace
## (Obs, Var) weights = (No, No)
## importance = none
## reinforcement = No
## ----------------------------------------
testboston = data.frame("lstat" = 2,
"rm" = 8,
"dis" = 6,
"crim" = 0.01)
RLT.Pred <- predict(RLT.boston, testboston, var.est = TRUE)
# confidence interval
lowerbound = RLT.Pred$Prediction - 1.96*sqrt(RLT.Pred$Variance)
upperbound = RLT.Pred$Prediction + 1.96*sqrt(RLT.Pred$Variance)
c(lowerbound, upperbound)
## [1] 43.22363 46.68151