The asymptotic normality of random forests is still an active research topic and hinges on how trees are built—e.g., the splitting rule, honesty, subsampling versus bootstrapping, and other tuning parameters (see, e.g., Mentch and Hooker (2016); Athey, Tibshirani, and Wager (2019)). In our previous lecture, we showed that using influence-function–based splits together with local estimation (honest or sample-splitting construction) yields an estimator that is asymptotically linear and hence asymptotically normal under standard regularity conditions. Accepting asymptotic normality as a working premise, the key practical problem becomes variance estimation, which enables pointwise confidence intervals for predictions and other functionals (e.g., variable importance). However, estimating the variance based on the asymptotic variance formula may lead to unsatisfactory performance. Therefore, this note focuses on variance estimation based on resampling approaches. We introduce four main tools for obtaining an estimate of the variance of random forests:
Note that all of these approaches can be applied to various quantities of interest in random forests, and even beyond random forests.
It is well known that the bootstrap approach can be used to approximate a sampling distribution. Mathematically, if we denote the observed data as \(X_1, X_2, \ldots, X_n\), drawn independently from an underlying distribution \(\mathbf{P}\), then the empirical distribution of this sample is defined as
\[ \mathbf{P}_n = \frac{1}{n} \sum_{i = 1}^n \delta_{X_i}, \]
where \(\delta_{X_i}\) places a point mass at \(X_i\). The following example illustrates the empirical estimation of the c.d.f. of a univariate variable.
# obtain some samples from normal
n = 50
x = rnorm(n)
# calculate the empirical cdf using apply
u = seq(-3.5, 3.5, length.out = 200)
mycdf <- function(u, X)
{
cdfe = rep(0, length(u))
for (i in 1:length(u))
cdfe[i] = mean(X <= u[i])
return(cdfe)
}
Fn = mycdf(u, x)
# calculate the true cdf
Ftrue = pnorm(u)
# plot the estimated cdf
plot(u, Ftrue, type = "l", xlab = "x", ylab = "F(x)",
lwd = 2, col = "deepskyblue", main = "Empirical CDF")
lines(u, Fn, col = "darkorange", lwd = 3)
legend("topleft", legend = c("Empirical CDF", "True CDF"),
col = c("darkorange", "deepskyblue"), lwd = 2)
# the x bar and confidence interval of x bar
mean(x)
## [1] 0.03336147
c(mean(x) - 1.96*sd(x)/sqrt(n), mean(x) + 1.96*sd(x)/sqrt(n))
## [1] -0.2412370 0.3079599
Since the empirical distribution \(\mathbf{P}_n\) converges to the true distribution \(\mathbf{P}\) as \(n\) grows, random draws with replacement (i.e., bootstrap samples) from \(\mathbf{P}_n\) can be viewed as approximate draws from \(\mathbf{P}\). This leads to two important consequences:
Below is an example that estimates the variance of the sample mean \(\bar{x}\) using bootstrap samples.
# plot the estimated cdf
plot(u, Ftrue, type = "l", xlab = "x", ylab = "F(x)", lwd = 2, col = "deepskyblue")
lines(u, Fn, col = "darkorange", lwd = 3)
# bootstrapped means
nbs = n
xbarb = rep(0, nbs)
for (i in 1:nbs)
{
xb = sample(x, replace = TRUE)
points(u, mycdf(u, xb), type = "s", col = "gray")
xbarb[i] = mean(xb)
}
# bootstrap estimation of 90% confidence interval of xbar
quantile(xbarb, c(0.05, 0.95))
## 5% 95%
## -0.1826668 0.2774558
The bootstrap idea can also be applied to random forests. By fitting random forests to multiple bootstrap samples, we can approximate the sampling distribution of the random forest estimator \(\hat{f}_\text{RF}(x)\) itself1. A naive bootstrap procedure may work as follows:
However, this approach can be problematic for certain statistics, particularly variable importance. The main issue arises because bootstrap samples contain replicated observations. Although each tree’s out-of-bag (OOB) set is disjoint from its own in-bag data, the reuse of the same observations across many trees leads to dependence between in-bag and OOB predictions at the forest level. This dependence can introduce bias in the variable importance calculation, especially when the OOB predictions are used as an internal validation set.
To mitigate this problem, Ishwaran and Lu (2019) proposed modified bootstrap procedures for variable importance. One such approach, often referred to as the 0.164 bootstrap estimator, further removes any OOB observations that were already selected in-bag for the same tree, thereby enforcing stronger independence between training and testing samples within each tree. Under this scheme, the effective OOB size is about \(0.164n\) on average, hence the name. This method is sometimes described as a form of double bootstrap and is applicable to general random forest models. Below is an example using the randomForestSRC
package. In this code, we simulate data with \(p = 20\) predictors, but only the first and the third variables are truly useful.
set.seed(1)
n = 300
p = 20
x = matrix(rnorm(n*p), n, p)
y = x %*% c(1, 0, 1, rep(0, p-3)) + rnorm(n)
library(randomForestSRC)
# training the forest
reg.o <- rfsrc(y ~ ., data.frame(x, y))
# the .164 double bootstrap approach
reg.dbs.o <- subsample(reg.o, B = 100, bootstrap = TRUE, verbose = FALSE)
plot.subsample(reg.dbs.o)
The Jackknife is another classical approach for estimating the variance of an estimator. Its main idea is simple: remove one observation at a time and measure how much the estimator changes. The Jackknife estimator is closely connected to the influence function we introduced previously, which quantifies the sensitivity of an estimator to a small perturbation of the data. In fact, the Jackknife can be viewed as a discrete, empirical approximation to the influence function. By aggregating the squared effects of these leave-one-out perturbations, we obtain an estimate of the variance of the estimator. Formally, the Jackknife variance estimator is given by
\[ \hat{\mathrm{V}}_\mathrm{J}(\hat{\theta}_\ast) = \frac{n-1}{n} \sum_{i=1}^{n} \left( \hat{\theta}_{(-i)} - \hat{\theta}_\ast \right)^2, \]
where \(\hat{\theta}_\ast\) is the average of the leave-one-out estimates,
\[ \hat{\theta}_\ast = \frac{1}{n} \sum_{i=1}^{n} \hat{\theta}_{(-i)}, \]
and \(\hat{\theta}_{(-i)}\) is the estimator computed with the \(i\)-th observation removed.
For the sample mean, you can verify that \(\hat{\theta}_\ast = \bar{x}\), and the Jackknife variance estimator coincides exactly with the theoretical variance of \(\bar{x}\).
# obtain some samples from normal
n = 100
x = rnorm(n)
# calculating the influence function
influence = rep(0, length(x))
xbar = mean(x) # theta_*
for (i in 1:length(x))
{
xbar_no_i = mean(x[-i])
influence[i] = xbar_no_i - xbar
}
# jackknife variance estimation
var_est = sum(influence^2) * (length(x) - 1) / (length(x))
# compare with the theoretical estimation
var_est
## [1] 0.009547256
var(x) / length(x)
## [1] 0.009547256
Interestingly (or perhaps not!), the Jackknife variance estimator is identical to the theoretical variance of \(\hat{\theta} = \bar{x}\) when \(\hat{\theta}\) is a linear estimator of the samples. In such cases, the total variance of the estimator is exactly the sum of the contributions from each observation, since each data point affects the estimate linearly and independently. This connection illustrates why the Jackknife works so well for simple estimators like the mean, and why it often provides accurate approximations for more complex, approximately linear estimators. We will revisit this logic later when constructing confidence intervals for random forest predictions. For now, however, let’s return to the application of variable importance, extending the idea slightly beyond this simple linear case.
Since the OOB data contain only a small fraction of the total sample, the variable importance evaluation based on bootstrap samples can be inefficient and unstable. To obtain a reliable variance estimate, one may need to perform a large number of bootstrap replications. An alternative is to use subsamples of size \(n - d\) (without replacement) instead of bootstrap samples when fitting each replicate of the random forest, and then approximate the variance using the deviations of these subsample-based estimators from the full-sample estimate.
In this setup, each subsample of size \(n - d\) contains no duplicated observations, and a standard random forest can be fit on each subsample. Because each tree within a subsample still uses a fraction of the data for training, the expected number of out-of-bag observations per tree is roughly \((1 - 0.632)(n - d)\). The delete-\(d\) Jackknife estimator (Shao and Wu 1989) is a generalization of the traditional Jackknife and is defined as
\[ \hat{\mathrm{V}}_{\mathrm{J(d)}}(\hat{\theta}_n) = \frac{n - d}{d} {\binom{n}{n - d}}^{-1} \sum_{|S| = n - d} \big( \hat{\theta}_{(-S)} - \bar{\theta}_{n, d} \big)^2, \]
where \(\bar{\theta}_{n, d}\) is the average of all delete-\(d\) estimators, analogous to the standard Jackknife case. In practice, \(\bar{\theta}_{n, d}\) is often well approximated by the estimator fitted on the full dataset, and the scaling factor \(\frac{n - d}{d}\) adjusts for the reduced sample size in each subsample. Since enumerating all possible subsets of size \(n - d\) is computationally infeasible, we typically use a finite number of random subsamples to approximate the expectation. The following code uses the Boston housing data to fit a random forest. Then estimate the variance of the variable importance measure, select a smaller set of significant variables and refit the model with the selected variables.
data("Boston", package = "MASS")
# training the forest
reg.boston <- rfsrc(medv ~ ., Boston, ntree = 1000, mtry = 13, nodesize = 5, nsplit = 5)
tail(reg.boston$err.rate, 1)
## [1] 12.21841
# Delete-d jackknife estimator
reg.smp.boston <- subsample(reg.boston, B = 200, bootstrap = FALSE, verbose = FALSE)
plot.subsample(reg.smp.boston)
# refit the model
reg.boston <- rfsrc(medv ~ lstat + rm + dis + crim + nox + rad + tax, data = Boston,
ntree = 1000, mtry = 6, nodesize = 6, nsplit = 5)
tail(reg.boston$err.rate, 1)
## [1] 12.61175
While the previous approaches can estimate the variance of various quantities in random forests, they typically require refitting the model many times, which can be computationally expensive. However, if we look more closely inside a random forest, it already generates many variations of subsamples or bootstrap samples when constructing individual trees. Although we cannot simply resample the trees after fitting to estimate the variance, we can exploit this built-in resampling structure in a more efficient way. In this section, we focus on estimating the variance of predictions at a new target point \(x_0\), following the Jackknife procedure introduced in Wager, Hastie, and Efron (2014).
The key idea is to adapt the Jackknife variance formula to the prediction setting, rather than to variable importance or other derived statistics. Specifically, if we view the random forest \(\hat{f}_\mathrm{RF}(x_0)\) as an estimator of the conditional mean at \(x_0\), then its Jackknife variance can be written as
\[ \hat{\mathrm{V}}_\mathrm{J}\!\big(\hat f_\mathrm{RF}(x_0)\big) = \frac{n - 1}{n} \sum_{i = 1}^{n} \left( \hat f_{(-i)}(x_0) - \hat f_\mathrm{RF}(x_0) \right)^2, \]
where \(\hat f_\mathrm{RF}(x_0)\) is the random forest fitted using all data, and each \(\hat f_{(-i)}(x_0)\) denotes the prediction from a forest trained with observation \(i\) removed. In practice, however, refitting \(n\) forests is infeasible. To address this, we note that a standard random forest is already an average of \(B\) trees, each trained on a bootstrap sample of the data:
\[ \hat f_\mathrm{RF}(x_0) = \frac{1}{B} \sum_{b=1}^{B} \hat f_b(x_0). \]
Since each bootstrap sample omits about \(36.8\%\) of the observations2, many trees are naturally trained without a given data point \(X_i\). We can therefore approximate \(\hat f_{(-i)}(x_0)\) by averaging the predictions from only those trees that did not include \(X_i\) in their bootstrap sample. This idea leads to the Jackknife-after-Bootstrap estimator proposed by Efron (1992), which efficiently estimates the prediction variance using just one fitted random forest. This approach is implemented in the ranger
package via the option se.method = "jack"
.
set.seed(1)
ntrain = 600
ntest = 100
n = ntrain + ntest
p = 30
x = matrix(rnorm(n*p), n, p)
b = c(0.5, 0, 0.5, rep(0, p-3))
y = x %*% b + rnorm(n)
traindata = data.frame(x, y)[1:ntrain, ]
testdata = data.frame(x, y)[-(1:ntrain), ]
# calculate confidence interval
library(ranger)
ranger.fit <- ranger(y~., data = traindata, num.trees = 5000,
keep.inbag=TRUE, mtry = 30, min.node.size = 10)
# Infinitesimal Jackknife (infjack) or jackknife-after-bootstrap (jack)
ranger.pred <- predict(ranger.fit, data = testdata,
type = "se", se.method = "jack")
# the true link function
testlink = data.matrix(testdata[, 1:p]) %*% b
# coverage
lowerbound = ranger.pred$predictions - 1.96*ranger.pred$se
upperbound = ranger.pred$predictions + 1.96*ranger.pred$se
cover = (testlink <= upperbound) & (testlink >= lowerbound)
# plot results
plot(testlink, ranger.pred$predictions,
xlim = c(-2, 2), ylim = c(-2, 2),
pch = 19, col = ifelse(cover, "green", "red"),
xlab = "True Mean", ylab = "Estimated Mean and 95% CI")
lines(x = c(-2, 2), y = c(-2, 2))
# the confidence interval
for (i in 1:nrow(testdata))
lines(x = c(testlink[i], testlink[i]),
y = c(lowerbound[i], upperbound[i]))
## average length
mean(ranger.pred$se)
## [1] 0.311566
A slight variation of the Jackknife-after-Bootstrap approach is the Infinitesimal Jackknife (IJ) estimator (Efron 2014), which quantifies how the prediction changes with infinitesimal perturbations to the sampling weights of each observation. Intuitively, it measures the correlation between how often each observation is used across trees and how the forest prediction changes as a result. The key motivation is that deleting a single observation introduces additional randomness—since each tree still depends on many other samples. Instead of explicitly removing observations, the IJ approximates the effect of a small up- or down-weighting of each training point \(X_i\) by computing the empirical covariance between the number of times \(X_i\) appears in each tree and the tree’s prediction at the target point \(x_0\).
The IJ variance estimator can be written as
\[ \hat{\mathrm{V}}_\mathrm{IJ}\!\big(\hat f_\mathrm{RF}(x_0)\big) = \sum_{i=1}^{n} \widehat{\mathrm{Cov}}_i^2 = \sum_{i=1}^{n} \left[ \frac{1}{B} \sum_{b=1}^{B} (N_{bi} - 1) \big(\hat f_b(x_0) - \hat f_\mathrm{RF}(x_0)\big) \right]^2, \]
where \(N_{bi}\) is the number of times observation \(i\) appears in the bootstrap sample for the \(b\)-th tree, and \(\hat f_b(x_0)\) is the prediction from that tree at the target point \(x_0\). This formulation was later extended to random forests by Wager, Hastie, and Efron (2014).
In practice, both the Jackknife-after-Bootstrap and Infinitesimal Jackknife estimators can exhibit upward bias in finite samples. The bias decreases as the number of trees \(B\) increases (hence, using a large forest is recommended), and several bias-correction methods have been proposed (see also Sexton and Laake (2009)). Nevertheless, Jackknife-based estimators often overestimate the variance, particularly when the estimator is not approximately linear or smooth. The following code performs prediction and constructs confidence intervals using the Infinitesimal Jackknife implemented in the ranger
package.
# Infinitesimal Jackknife (infjack) or jackknife-after-bootstrap (jack)
ranger.pred <- predict(ranger.fit, data = testdata,
type = "se", se.method = "infjack")
# the true link function
testlink = data.matrix(testdata[, 1:p]) %*% b
# coverage
lowerbound = ranger.pred$predictions - 1.96*ranger.pred$se
upperbound = ranger.pred$predictions + 1.96*ranger.pred$se
cover = (testlink <= upperbound) & (testlink >= lowerbound)
# plot results
plot(testlink, ranger.pred$predictions,
xlim = c(-2, 2), ylim = c(-2, 2),
pch = 19, col = ifelse(cover, "green", "red"),
xlab = "True Mean", ylab = "Estimated Mean and 95% CI")
lines(x = c(-2, 2), y = c(-2, 2))
# the confidence interval
for (i in 1:nrow(testdata))
lines(x = c(testlink[i], testlink[i]),
y = c(lowerbound[i], upperbound[i]))
## average length
mean(ranger.pred$se)
## [1] 0.2509218
Practice Example: Use the Boston
dataset to predict medv
using four variables lstat
, rm
, dis
, and crim
. Predict a new listing with lstat
= 2, rm
= 8, dis
= 6 and crim
= 0.01. It should have relatively high prediction.
# training the forest
ranger.boston <- ranger(medv ~ lstat + rm + dis + crim, data = Boston,
num.trees = 10000, keep.inbag=TRUE)
testboston = data.frame("lstat" = 2,
"rm" = 8,
"dis" = 6,
"crim" = 0.01)
ranger.pred <- predict(ranger.boston, data = testboston,
type = "se", se.method = "infjack")
## Warning in rInfJack(pred = result$predictions, inbag = inbag.counts, used.trees = 1:num.trees): Sample size <=20, no calibration performed.
c(ranger.pred$predictions - 1.96 * ranger.pred$se,
ranger.pred$predictions + 1.96 * ranger.pred$se)
## [1] 45.21598 49.62977
U-statistics (Hoeffding 1948) provide a general framework for constructing unbiased estimators of population parameters, particularly for independent and identically distributed (i.i.d.) random variables. The general form of a U-statistic for a sample \((X_1, X_2, \ldots, X_n)\) is
\[ U_n = \binom{n}{k}^{-1} \sum_{1 \leq i_1 < i_2 < \cdots < i_k \leq n} h(X_{i_1}, X_{i_2}, \ldots, X_{i_k}), \]
where \(h(\cdot)\) is called the kernel function, and it is chosen so that \(h(X_{i_1}, \ldots, X_{i_k})\) is an unbiased estimator of some parameter \(\theta\):
\[ E[h(X_{i_1}, X_{i_2}, \ldots, X_{i_k})] = \theta. \]
In the context of random forests, we can view \(h(\cdot)\) as the tree estimator built on a subsample of size \(k\). Then, the random forest estimator can be regarded as the mean of many such kernel evaluations. To simplify notation, let \(S_i\) denote the \(i\)-th subset of the data with size \(k\) drawn from the \(n\) observations. There are \(\binom{n}{k}\) such subsets, and the U-statistic can be written equivalently as
\[ U_n = \binom{n}{k}^{-1} \sum_{i} h(S_i). \]
An example can be given for estimating the variance. The sample variance can be expressed as a U-statistic with kernel function \(h(X_i, X_j) = \frac{(X_i - X_j)^2}{2}\) and \(k = 2\)3.
We can now extend this framework to random forests by treating each tree prediction as a kernel function \(h(S_i)\) based on a subsample \(S_i\) of size \(k\). If we were to average over all possible subsamples, the random forest estimator could be written as
\[ \hat f_\mathrm{RF}(x) = U_n = \binom{n}{k}^{-1} \sum_{|S_i| = k} h(x, S_i). \]
The variance of this U-statistic can then be decomposed using Hoeffding’s decomposition (Hoeffding 1948):
\[ \mathrm{Var}(U_n) = \binom{n}{k}^{-1} \sum_{d=1}^{k} \binom{k}{d} \binom{n-k}{k-d} \xi_{d,k}^2, \]
where \(\xi_{d,k}^2\) denotes the covariance between two kernel estimates \(h(S_1)\) and \(h(S_2)\)
whose subsamples share \(d\) overlapping observations, i.e.,
\(\xi_{d,k}^2 = \mathrm{Cov}(h(S_1), h(S_2))\) when \(|S_1 \cap S_2| = d\).
Equivalently, we can write (Lee 2019):
\[ \xi_{d,k}^2 = \mathrm{Var}\!\left[ E\big( h(S) \mid X_1, \ldots, X_d \big) \right]. \]
A key insight is that when \(k \ll n\), the first term of the decomposition (i.e., \(d = 1\)) dominates the sum because the combinatorial coefficient \(\binom{k}{d} \binom{n-k}{k-d}\) rapidly decreases as \(d\) grows. Hence, the overall variance is mainly driven by \(\xi_{1,k}^2\), the covariance between two tree estimators that share exactly one observation. This idea was formalized by Mentch and Hooker (2016), who established a U-statistic perspective on random forests.
In practice, several difficulties arise. When \(k\) is large, the first term no longer dominates, and estimating all \(\xi_{d,k}^2\) terms becomes computationally infeasible. Although some bias-corrected approximations have been proposed (e.g., (Zhou, Mentch, and Hooker 2021)), they are not fully unbiased when \(k\) is large. Recently, Xu, Zhu, and Shao (2024) addressed this issue using the conditional variance identity:
\[ \begin{aligned} \xi_{d,k}^2 &= \mathrm{Var}\!\left[ E\big( h(S) \mid X_1, \ldots, X_d \big) \right] \\ &= \mathrm{Var}(h(S)) - E\!\left[\mathrm{Var}\big(h(S) \mid X_1, \ldots, X_d\big)\right], \end{aligned} \]
where the first term can be estimated using independent trees (trained on disjoint samples), and the second term can be estimated using trees that share \(d\) common observations. By applying this trick across all terms in Hoeffding’s decomposition, we obtain a simple and unbiased estimator of the variance:
\[ \mathrm{Var}(U_n) = V^{(h)} - V^{(s)}, \]
where \(V^{(h)}\) represents the variance of tree predictions based on independent subsamples, and \(V^{(s)}\) represents the variance of predictions from trees whose subsamples overlap within the same dataset. The following example, implemented using the RLT
package, illustrates this approach. A practical limitation is that this method is currently restricted to cases where \(k \leq n/2\).
# generate the same dataset as previous example
set.seed(1)
ntrain = 600
ntest = 100
n = ntrain + ntest
p = 30
x = matrix(rnorm(n*p), n, p)
b = c(0.5, 0, 0.5, rep(0, p-3))
y = x %*% b + rnorm(n)
traindata = data.frame(x, y)[1:ntrain, ]
testdata = data.frame(x, y)[-(1:ntrain), ]
library(RLT)
## RLT and Random Forests v4.2.6
## pre-release at github.com/teazrq/RLT
RLT.fit <- RLT(traindata[, 1:p], traindata$y, model = "regression",
ntrees = 20000, mtry = 30, nmin = 5,
resample.replace = FALSE, resample.prob = 0.5,
param.control = list("var.ready" = TRUE,
"resample.track" = TRUE),
verbose = TRUE)
## Regression Random Forest ...
## ---------- Parameters Summary ----------
## (N, P) = (600, 30)
## # of trees = 20000
## (mtry, nmin) = (30, 5)
## split generate = Random, 1
## sampling = 0.5 w/o replace
## (Obs, Var) weights = (No, No)
## importance = none
## reinforcement = No
## ----------------------------------------
RLT.Pred <- predict(RLT.fit, testdata, var.est = TRUE)
# coverage
lowerbound = RLT.Pred$Prediction - 1.96*sqrt(RLT.Pred$Variance)
## Warning in sqrt(RLT.Pred$Variance): NaNs produced
upperbound = RLT.Pred$Prediction + 1.96*sqrt(RLT.Pred$Variance)
## Warning in sqrt(RLT.Pred$Variance): NaNs produced
cover = (testlink <= upperbound) & (testlink >= lowerbound)
# plot results
plot(testlink, RLT.Pred$Prediction,
xlim = c(-2, 2), ylim = c(-2, 2),
pch = 19, col = ifelse(cover, "green", "red"),
xlab = "True Mean", ylab = "Estimated Mean and 95% CI")
lines(x = c(-2, 2), y = c(-2, 2))
# the confidence interval
for (i in 1:nrow(testdata))
lines(x = c(testlink[i], testlink[i]),
y = c(lowerbound[i], upperbound[i]))
## average length is typically smaller than Jackknife
## you may occasionally get negative estimations...
mean(sqrt(RLT.Pred$Variance), na.rm = TRUE)
## Warning in sqrt(RLT.Pred$Variance): NaNs produced
## [1] 0.159926
Practice Example: Use the Boston
dataset to predict medv
using four variables lstat
, rm
, dis
, and crim
(columns 1, 6, 8, and 13). Predict a new listing with lstat
= 2, rm
= 8, dis
= 6 and crim
= 0.01. It should have relatively high prediction.
data("Boston", package = "MASS")
# training the forest
library(RLT)
RLT.boston <- RLT(x = Boston[, c(1, 6, 8, 13)], y = Boston$medv,
ntrees = 10000, mtry = 4, nmin = 5,
resample.replace = FALSE, resample.prob = 0.5,
param.control = list("var.ready" = TRUE,
"resample.track" = TRUE),
verbose = TRUE)
## Regression Random Forest ...
## ---------- Parameters Summary ----------
## (N, P) = (506, 4)
## # of trees = 10000
## (mtry, nmin) = (4, 5)
## split generate = Random, 1
## sampling = 0.5 w/o replace
## (Obs, Var) weights = (No, No)
## importance = none
## reinforcement = No
## ----------------------------------------
testboston = data.frame("lstat" = 2,
"rm" = 8,
"dis" = 6,
"crim" = 0.01)
RLT.Pred <- predict(RLT.boston, testboston, var.est = TRUE)
# confidence interval
lowerbound = RLT.Pred$Prediction - 1.96*sqrt(RLT.Pred$Variance)
upperbound = RLT.Pred$Prediction + 1.96*sqrt(RLT.Pred$Variance)
c(lowerbound, upperbound)
## [1] 41.96696 47.95851
Whats wrong with taking boostrapped trees from a fitted random forest and use their distribution of variable importance? Well, that will be estimating the variance of a tree estimator, but not the variance of a tree estimator.↩︎
Each bootstrap sample is drawn with replacement from \(n\) observations. The probability that a specific observation is not selected in one draw is \(1 - 1/n\). After \(n\) draws, the probability that it is never selected is \((1 - 1/n)^n \approx e^{-1} \approx 0.368\). Thus, on average, about 36.8% of the data points are omitted in each bootstrap sample.↩︎
As a simple illustration, consider the U-statistic representation of the sample variance: \[ U_n = \binom{n}{2}^{-1} \sum_{1 \leq i < j \leq n} \frac{(X_i - X_j)^2}{2}. \] To show that this expression is equivalent to the unbiased sample variance, we expand the pairwise terms and insert the sample mean \(\bar{X}\) into the difference: \[ \begin{aligned} U_n &= \binom{n}{2}^{-1} \frac{1}{4} \sum_{i,j} (X_i - X_j)^2 \\ &= \frac{1}{2n(n-1)} \sum_{i,j} (X_i - \bar{X} + \bar{X} - X_j)^2 \\ &= \frac{1}{2n(n-1)} \sum_{i,j} \left\{ (X_i - \bar{X})^2 + (X_j - \bar{X})^2 + 2(X_i - \bar{X})(X_j - \bar{X}) \right\} \\ &= \frac{n}{n(n-1)} \sum_{i=1}^n (X_i - \bar{X})^2 + 0 \\ &= \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2, \end{aligned} \] which is exactly the unbiased estimator of the population variance.↩︎