Variance Estimation in Random Forest

Normality and Confidence Interval of Random Forest

The normality of random forests remains an active area of research. In fact, its assessment may depend on various assumptions concerning the construction of the splitting rule and associated tuning parameters (see, e.g., Mentch and Hooker (2016) and Athey, Tibshirani, and Wager (2019)). If one generally accepts that random forests are asymptotically normal, then accurate estimation of their variance would enable the construction of confidence intervals for both predictions and other associated quantities. Therefore, this note focuses on variance estimation for random forests. We introduce three main tools for obtaining an estimation of the variance of random forests:

Bootstrap,
Jackknife and delete-d Jackknife
Infinitesimal Jackknife
U-statistics

Note that all approaches can be applied to various quantities of interest in random forest, and even beyond random forests. But we discuss their utilization in different tasks.

Bootstrap: Basics

It is well known that the bootstrap approach can be used to approximate the sampling distribution. Mathematically, if we denote the observed data as \(X_1, X_2, \ldots, X_n\), drawn independently from an underlying distribution \(\mathbf{P}\), then we define the empirical distribution of this sample as \(\mathbf{P}_n = n^{-1} \sum_{i = 1}^n \delta(X_i)\), where \(\delta\) places a point mass at \(X_i\) and is 0 otherwise. The following is an example of the empirical estimation of the c.d.f of a univariate variable.

  # obtain some samples from normal
  n = 100
  x = rnorm(n)
  
  # calculate the empirical cdf using apply
  u = seq(-3.5, 3.5, length.out = 200)

  mycdf <- function(u, X)
  {
    cdfe = rep(0, length(u))
    for (i in 1:length(u))
      cdfe[i] = mean(X <= u[i])
    return(cdfe)
  }

  Fn = mycdf(u, x)
  
  # plot the estimated cdf
  plot(u, Fn, type = "l", xlab = "x", ylab = "F(x)")

  
  # the x bar and confidence interval of x bar
  mean(x)
## [1] -0.01861898
  c(mean(x) - 1.96*sd(x)/sqrt(n), mean(x) + 1.96*sd(x)/sqrt(n))
## [1] -0.2206543  0.1834164

Since we can expect the sampling distribution \(\mathbf{P}_n\) to converge to the true distribution \(\mathbf{P}\) as \(n\) grows, we can then anticipate that random and independent draws with replacement (bootstrap samples) from the distribution \(\mathbf{P}_n\) will also converge to \(\mathbf{P}\). This comes with important consequences that

Bootstrap samples mimics independently observed samples from the original \(\mathbf{P}\)
We can use these bootstrap samples to quantify the variance of any statistic

Here is an example for estimating the variance of \(\bar{x}\), but using bootstrap samples:

  # plot the estimated cdf
  plot(u, Fn, type = "l", xlab = "x", ylab = "F(x)", lwd = 2, col = "red")

  # bootstrapped means
  nbs = 100
  xbarb = rep(0, nbs)
  
  for (i in 1:nbs)
  {
    xb = sample(x, replace = TRUE)
    points(u, mycdf(u, xb), type = "s", col = "gray")
    xbarb[i] = mean(xb)
  }


  # bootstrap estimation of 90% confidence interval of xbar
  quantile(xbarb, c(0.05, 0.95))
##         5%        95% 
## -0.1634821  0.1565188

Bootstraped Confidence Interval for Variable Importance

This idea of bootstrapping can be applied to random forests. By fitting random forests to these bootstrap samples, we can approximate the distribution of the random forest estimator \(\hat{f}_\text{RF}(x)\) itself. A naive procedure may work as follows:

Draw a bootstrap sample from the original \(n\) observations.
Fit a random forest estimator to the bootstrap sample.
Calculate and record the quantity of interest (e.g. variable importance) from the random forest
Repeat the above steps many times to obtain the distribution of the quantity of interest and make inferences.

However, this approach may encounter significant issues for some quantities of interest, such as variable importance. The main issue is that bootstrap samples contain replicates of the same observation. Hence, in a bootstrap sample, the same observation may appear in both the training examples (in-bag data) and the out-of-bag (oobag) data in a given tree. This introduces bias into the variable importance calculation because the in-bag and oobag data are not independent.

To address this issue, Ishwaran and Lu (2019) proposed two approaches. One is called the .164 bootstrap estimator. The main idea is straightforward: when using a bootstrap sample to fit a random forest, remove the observations from the oobag data if they are already used in the in-bag tree fitting. Based on some simple mathematical approximations, this results in approximately \(0.164 n\) unique observations in the oobag set of each tree, and the resulting estimator is called a double-bootstrap estimator (Ishwaran and Lu 2019) for variable importance. This approach is applicable to any type of random forest. An example using the randomForestSRC package is presented. In this code, we have 10 variables, while only the first and third variables are useful.

  set.seed(1)
  n = 300
  p = 30
  x = matrix(rnorm(n*p), n, p)
  y = x %*% c(1, 0, 1, rep(0, p-3)) + rnorm(n)

  library(randomForestSRC)
  
  # training the forest
  reg.o <- rfsrc(y ~ ., data.frame(x, y))
  
  # the .164 double bootstrap approach
  reg.dbs.o <- subsample(reg.o, B = 100, bootstrap = TRUE, verbose = FALSE)
  plot.subsample(reg.dbs.o)

Question

Whats wrong with taking boostrapped trees from a fitted random forest and use their distribution of variable importance?

Jackknife: Basics

The Jackknife is another approach for estimating the variance of an estimator. The main idea is to remove one observation at a time and observe how much impact it has to change the estimator. The Jackknife estimator is closely related to the influence function, which can be used to quantify the variance of any estimator and the Jackknife is an empirical approximation to the influence function. Then the variance of the estimator can be quantified as the sum of such impacts. A general formula is given by the following

\[ \hat{\text{V}}_\text{J}(\hat \theta_\ast) = \frac{n-1}{n} \sum_{i = 1}^{n} \left( \hat \theta_{(-i)} - \hat \theta_\ast \right)^2, \]

where \(\hat \theta_\ast\) is the average of all such remove-one estimators, i.e.,

\[ \hat\theta_\ast = \frac{1}{n} \sum_i \hat \theta_{(-i)}, \]

and \(\hat \theta_{(-i)}\) is the estimator calculated with observation \(i\) removed. For calculating the mean of a set of samples, you can check that \(\hat\theta_\ast\) is simply the \(\bar x\). The following code performs this calculation.

  # obtain some samples from normal
  n = 100
  x = rnorm(n)
  
  # calculating the influence function
  influence = rep(0, length(x))
  xbar = mean(x) # theta_*
  
  for (i in 1:length(x))
  {
    xbar_no_i = mean(x[-i])
    influence[i] = xbar_no_i - xbar
  }
  
  # jackknife variance estimation
  var_est = sum(influence^2) * (length(x) - 1) / (length(x))
  
  # compare with the theoretical estimation
  var_est
## [1] 0.008219851
  var(x) / length(x)
## [1] 0.008219851

Surprisingly (or maybe not?), the jackknife variance estimator is identical to the theoretical estimator of the variance of \(\hat \theta = \bar x\) when \(\hat \theta\) is a linear estimator of the samples. In this case, the variance of the estimator is exactly contributed by the variance from each observation. We will use this logic for estimating the confidence interval of prediction later one. But for now, let’s go back to the application of variable importance with a slight extension.

Delete-\(d\) Jackknife Confidence Interval for Variable Importance

Since the oobag data only contain a small fraction of the total sample, the variable importance evaluation in the bootstrap version may be inefficient and unstable. Then you may need to perform a lot of bootstrap samples to obtain the variance estimation. One alternative is to use subsamples, with size \(n - d\), instead of the bootstrap samples to fit each replicate of the random forest and approximate the variance using their deviations to the full sample random forests. In this case, the size \(n - d\) subsamples would not contain replicates, and we can still use a regular random forest to fit the subsample. And you will get approximately \((1 - 0.632)(n - d)\) oobag observations in each tree. The delete-d jackknife estimator (Shao and Wu 1989) is a variant of the traditional jackknife estimator. The formula (if we run all possible \(n - d\) sets) is given by

\[ \hat{\text{V}}_{\text{J(d)}}(\hat \theta_n) = \frac{n-d}{d} {\binom{n}{n-d}}^{-1} \sum_{|S| = n - d} (\hat \theta_{(-S)} - \bar \theta_{n, d})^2 \]

where \(\bar \theta_{n, d}\) is the average of all such delete-\(d\) estimators, similar to the jackknife. However, in practice, we can either treat \(\bar \theta_{n, d}\) as the random forest fitted to the entire sample and ignore their differences. The fraction \(\frac{n-d}{d}\) is a scaling factor to correct for the reduced sample size. In practice, we only need to run a finite number of subsets, rather than all possible size \(n-d\) subsamples, to obtain an approximation of it. Here is an example code using the previously generated data.

  # Delete-d jackknife estimator
  reg.smp.o <- subsample(reg.o, verbose = FALSE)
  plot.subsample(reg.smp.o)

Practice Example

Use the Boston housing data to fit a random forest. Then estimate the variance of the variable importance measure, select a smaller set of significant variables and refit the model with the selected variables.

  data("Boston", package = "MASS")

  # training the forest
  reg.boston <- rfsrc(medv ~ ., Boston, ntree = 1000, mtry = 13, nodesize = 5, nsplit = 5)
  tail(reg.boston$err.rate, 1)
## [1] 12.39758
  
  # Delete-d jackknife estimator
  reg.smp.boston <- subsample(reg.boston, B = 200, bootstrap = FALSE, verbose = FALSE)
  plot.subsample(reg.smp.boston)


  # refit the model
  reg.boston <- rfsrc(medv ~ lstat + rm + dis + crim + nox + rad + tax, data = Boston, 
                      ntree = 1000, mtry = 6, nodesize = 6, nsplit = 5)
  tail(reg.boston$err.rate, 1)  
## [1] 12.50151

Jackknife Procedure for Variance of Prediction

While the previous approaches can estimate the variance of quantities in a random forests, they still requires fitting many random forests, which can be computationally inefficient. However, if we look more closely inside a random forests, it already try to create many variations of subsamples or bootstrap samples using the trees. Although we cannot simply resample these trees to calculate the variance, we might be able to utilize the mechanism in some other ways. For this section, we focus on the variance estimation of a different task: predicting a new target point \(x_0\). We start with the Jackknife procedure proposed in Wager, Hastie, and Efron (2014).

The idea of this Jackknife procedure can be easily incorporated into quantifying the variance of a prediction, rather than variable importance. Of course this comes with a little bit of twist. But let’s first observe the Jackknife again if we view the random forest \(\hat f\) as a prediction (parameter) at a new target point \(x_0\):

\[ \hat{\text{V}}_\text{J}\big(\hat f_\text{RF}(x_0) \big) = \frac{n-1}{n} \sum_{i = 1}^{n} \left( \hat f_{(-i)}(x_0) - \hat f_\text{RF}(x_0) \right)^2. \]

Here, \(\hat f_\text{RF}(x_0)\) is the random forest estimator built using all observations, and it is the average of \(B\) number of trees (using different samples of the original \(n\) samples):

\[ \hat f_\text{RF}(x_0) = \frac{1}{B} \sum_b \hat f_b(x_0), \]

The quantity \(\hat f_{(-i)}(x_0)\) is also a random forest prediction, but is built with observations without \(X_i\). Then the question is, can we estimate the variance with just one fitted random forest? Consider the following strategy:

The \(\hat f_\text{RF}(x_0)\) is just the regular random forest, with bootstrap samples used in each tree
Many of the trees in the random forests do not use \(X_i\), hence, we can average these trees to obtain \(\hat f_{(-i)}(x_0)\)

This is the so-called Jackknife-after-Bootstrap estimate proposed by Efron (1992). This procedure is implemented in the ranger package.

  set.seed(1)
  ntrain = 600
  ntest = 100
  n = ntrain + ntest
  p = 30
  x = matrix(rnorm(n*p), n, p)
  b = c(0.5, 0, 0.5, rep(0, p-3))
  y = x %*% b + rnorm(n)
  
  traindata = data.frame(x, y)[1:ntrain, ]
  testdata = data.frame(x, y)[-(1:ntrain), ]

  # calculate confidence interval
  library(ranger)
  ranger.fit <- ranger(y~., data = traindata, num.trees = 5000,
                       keep.inbag=TRUE, mtry = 30, min.node.size = 10)
  
  # Infinitesimal Jackknife (infjack) or jackknife-after-bootstrap (jack)
  ranger.pred <- predict(ranger.fit, data = testdata, 
                         type = "se", se.method = "jack")

  # the true link function
  testlink = data.matrix(testdata[, 1:p]) %*% b
  
  # coverage
  lowerbound = ranger.pred$predictions - 1.96*ranger.pred$se
  upperbound = ranger.pred$predictions + 1.96*ranger.pred$se
  cover = (testlink <= upperbound) & (testlink >= lowerbound)
  
  # plot results
  plot(testlink, ranger.pred$predictions, 
       xlim = c(-2, 2), ylim = c(-2, 2), 
       pch = 19, col = ifelse(cover, "green", "red"), 
       xlab = "True Mean", ylab = "Estimated Mean and 95% CI")
  lines(x = c(-2, 2), y = c(-2, 2))
  
  # the confidence interval
  for (i in 1:nrow(testdata))
    lines(x = c(testlink[i], testlink[i]),
          y = c(lowerbound[i], upperbound[i]))

  
  ## average length
  mean(ranger.pred$se)
## [1] 0.311566

The Infinitesimal Jackknife

A slight variation of this approach is the Infinitesimal Jackknife estimator (Efron 2014), which utilize the correlation between the number of times an observation appears in the random forest with the forest prediction. The motivation is that, when deleting just one observation, the rest of the trees still have many uncertainties involved. To estimate the influence of adding/subtracting one observation, we could approximate it by looking at the correlation between how many times a tree involves a subject \(X_i\) and the tree prediction. The formula is given by

\[ \hat{\text{V}}_\text{IJ}\big(\hat f_\text{RF}(x_0)\big) = \sum_{i = 1}^{n} \widehat{ \text{Cov} }_i^2 = \sum_{i = 1}^{n} \left[ \frac{\sum_b (N_{bi} - 1) \left( \hat f_b(x) - \hat f_\text{RF}(x) \right) }{B} \right]^2, \]

where \(N_{bi}\) is the number of replicates of observation \(i\) in the \(b\)-th tree, and \(\hat f_b(x_0)\) is the \(b\)-th tree model prediction at \(x_0\). The random forest version of this estimator is also proposed in Wager, Hastie, and Efron (2014). It should be noted that the above two estimators can be badly biased upwards. However, with large number of \(B\) the bias can be slightly reduced (large \(B\) is always recommended in practice). Some bias-corrections were also introduced in the paper (also see Sexton and Laake (2009)). However, in practice, Jackknife related estimators often produce over estimations of the variance. One reason is that it relies on the behavior of the estimator being approximately linear and smooth. The following code performs predictions and confidence intervals using the Infinitesimal Jackknife.

  # Infinitesimal Jackknife (infjack) or jackknife-after-bootstrap (jack)
  ranger.pred <- predict(ranger.fit, data = testdata, 
                         type = "se", se.method = "infjack")

  # the true link function
  testlink = data.matrix(testdata[, 1:p]) %*% b
  
  # coverage
  lowerbound = ranger.pred$predictions - 1.96*ranger.pred$se
  upperbound = ranger.pred$predictions + 1.96*ranger.pred$se
  cover = (testlink <= upperbound) & (testlink >= lowerbound)
  
  # plot results
  plot(testlink, ranger.pred$predictions, 
       xlim = c(-2, 2), ylim = c(-2, 2), 
       pch = 19, col = ifelse(cover, "green", "red"), 
       xlab = "True Mean", ylab = "Estimated Mean and 95% CI")
  lines(x = c(-2, 2), y = c(-2, 2))
  
  # the confidence interval
  for (i in 1:nrow(testdata))
    lines(x = c(testlink[i], testlink[i]),
          y = c(lowerbound[i], upperbound[i]))

  
  ## average length
  mean(ranger.pred$se)
## [1] 0.2509218

Practice Example

Use the Boston dataset to predict medv using four variables lstat, rm, dis, and crim. Predict a new listing with lstat = 2, rm = 8, dis = 6 and crim = 0.01. It should have relatively high prediction.

  # training the forest
  ranger.boston <- ranger(medv ~ lstat + rm + dis + crim, data = Boston, 
                          num.trees = 10000, keep.inbag=TRUE)
  
  testboston = data.frame("lstat" = 2, 
                          "rm" = 8, 
                          "dis" = 6,
                          "crim" = 0.01)
  
  ranger.pred <- predict(ranger.boston, data = testboston, 
                         type = "se", se.method = "infjack")
## Warning in rInfJack(pred = result$predictions, inbag = inbag.counts, used.trees = 1:num.trees): Sample size <=20, no calibration performed.
  
  c(ranger.pred$predictions - 1.96 * ranger.pred$se, 
    ranger.pred$predictions + 1.96 * ranger.pred$se)
## [1] 45.21598 49.62977

U-statistics: Basics

U-statistics (Hoeffding 1948) provide a framework for unbiased estimation of population parameters, particularly for dependent and identically distributed (i.i.d) random variables. The general form of a U-statistic for a sample \((X_1, X_2, \ldots, X_n)\) is:

\[ U_n = {{n \choose k}}^{-1} \sum_{1 \leq i_1 < i_2 < \ldots < i_k \leq n} h(X_{i_1}, X_{i_2}, \ldots, X_{i_k}), \]

where we call \(h(\cdot)\) a kernel function, which is an unbiased estimator of some parameter \(\theta\), i.e.,

\[ E[ h(X_{i_1}, X_{i_2}, \ldots, X_{i_k}) ] = \theta. \]

In our case, we can view \(h(\cdot)\) as a tree estimator using some subsamples, and the random forest estimator is the mean of many such estimators. To simplify the notation, let \(S_i\) represent the \(i\)-th set of samples with size \(k\) from the \(n\) observations. Here, \(i\) ranges from 1 to \(\binom{n}{k}\), the total number of such sets. The U-statistic \(U_n\) can then also be written as:

\[ U_n = {\binom{n}{k}}^{-1} \sum_{i} h(S_i) \]

Example: Sample Variance as U-statistics

To give a simple example, we start with the U-statistic representation of sample variance, denoted as \(U_n\):

\[\begin{equation} U_n = {{n \choose 2}}^{-1} \sum_{1 \leq i < j \leq n} \frac{(X_i - X_j)^2}{2} \end{equation}\]

To directly relate this to the sample variance, we complete the summation to all \(i\) and \(j\), and then add and subtract the sample mean \(\bar{X}\) inside the square term. And then

\[\begin{align*} U_n &= {{n \choose 2}}^{-1} \frac{1}{4} \sum_{i, j} ( X_i - X_j )^2 \\ &= \frac{1}{2n(n-1)} \sum_{i, j} ( X_i - \bar{X} + \bar{X} - X_j )^2 \\ &= \frac{1}{2n(n-1)} \sum_{i, j} \left\{ (X_i - \bar{X})^2 + (X_j - \bar{X})^2 + 2(X_i - \bar{X})(X_j - \bar{X}) \right\} \\ &= \frac{n}{n(n-1)} \sum_{i=1}^n (X_i - \bar{X})^2 + 0\\ &= \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2, \end{align*}\]

which is exactly the unbiased variance estimator.

Estimating the Variance of U-Statistics

For random forests, we can naturally view it as a U-statistics, by treating a kernel function \(h(S_i)\) as a tree estimator of predicting a target point \(x\), and \(S_i\) is a size \(k\) subsample from the \(n\) observations. Hence, if we exhaust all possible subsamples, a random forest can be written as

\[ \hat f_\text{RF}(x) = U_n = {\binom{n}{k}}^{-1} \sum_{|S_i| = k} h(x, S_i) \]

Then, the variance of such a formulation can be quantified via the Hoeffding’s decomposition (Hoeffding 1948):

\[ \text{Var}\left(U_{n}\right) = \binom{n}{k}^{-1} \sum_{d=1}^{k} \binom{k}{d} \binom{n-k}{k-d} \xi^2_{d, k}, \]

where \(\xi^2_{d, k}\) is the covariance between two kernels \(h(S_1)\) and \(h(S_2)\) with \(S_1\) and \(S_2\) sharing \(d\) overlapping observations, i.e., \(\xi_{d, k}^2 = \text{Cov} \left( h(S_1), h(S_2) \right)\), with \(|S_1 \cap S_2| = d\). Here both \(S_1\) and \(S_2\) are size-\(k\) subsamples. Alternatively, we can represent \(\xi_{d, k}^2\) as (Lee 2019)

\[ \xi_{d, k}^2 = \text{Var} \left[ E \left (h(S) | X_1, ..., X_d \right) \right]. \]

A key phenomenon we could utilize is that when \(k\) is small relative to \(n\), the first term of this Hoeffding’s decomposition will dominate the entire sum because \(\binom{k}{d} \binom{n-k}{k-d}\) is so larger compared with all other terms. This means that we only need to consider \(\xi_{1, k}^2\), the covariance of two tree estimators if they share just one sample. This approach was illustrated in Mentch and Hooker (2016), who proposed this U-statistics view of random forest. Several difficulties are involved in this formulation. Mainly, when \(k\) is large, the first term does not dominate and one needs to estimate many of such terms, which is computationally infeasible. Although some corrections have been proposed (Zhou, Mentch, and Hooker 2021), they do not unbiasedly estimate the variance for large \(k\). However, Xu, Zhu, and Shao (2022) fixed the issue by using the conditional variance formula:

\[\begin{align*} \xi_{d, k}^2 =& \text{Var} \left[ E \left (h(S) | X_1, ..., X_d \right) \right] \\ =& \text{Var}(h(S)) - E\left[\text{Var}(h(S) | X_1, ..., X_d) \right], \end{align*}\]

where the first term can be unbiased estimated with trees that do not share any samples, and the second term can be estimated by trees that share \(d\) samples. Further combining this trick for all terms in the Hoeffding’s decomposition, we can have a simple formula of an unbiased estimation of the variance:

\[ \text{Var}\left(U_{n}\right) = \text{V}^{(h)} - \text{V}^{(s)}, \]

in which the first term is the variance of trees (with independent samples) and the second term is the sample variance of trees if their training samples are subsamples of the original \(n\) observations. Here is an example using the RLT package. One limitation of this approach is that it only allows \(k\) to be as large as half of \(n\).

  # generate the same dataset as previous example
  set.seed(1)
  ntrain = 600
  ntest = 100
  n = ntrain + ntest
  p = 30
  x = matrix(rnorm(n*p), n, p)
  b = c(0.5, 0, 0.5, rep(0, p-3))
  y = x %*% b + rnorm(n)
  
  traindata = data.frame(x, y)[1:ntrain, ]
  testdata = data.frame(x, y)[-(1:ntrain), ]
  
  library(RLT)
## RLT and Random Forests v4.2.6
## pre-release at github.com/teazrq/RLT
  RLT.fit <- RLT(traindata[, 1:p], traindata$y, model = "regression", 
              ntrees = 20000, mtry = 30, nmin = 5,
              resample.replace = FALSE, resample.prob = 0.5, 
              param.control = list("var.ready" = TRUE, 
                                   "resample.track" = TRUE),
              verbose = TRUE)
## Regression Random Forest ... 
## ---------- Parameters Summary ----------
##               (N, P) = (600, 30)
##           # of trees = 20000
##         (mtry, nmin) = (30, 5)
##       split generate = Random, 1
##             sampling = 0.5 w/o replace
##   (Obs, Var) weights = (No, No)
##           importance = none
##        reinforcement = No
## ----------------------------------------

  RLT.Pred <- predict(RLT.fit, testdata, var.est = TRUE)
  
  # coverage
  lowerbound = RLT.Pred$Prediction - 1.96*sqrt(RLT.Pred$Variance)
  upperbound = RLT.Pred$Prediction + 1.96*sqrt(RLT.Pred$Variance)
  cover = (testlink <= upperbound) & (testlink >= lowerbound)
  
  # plot results
  plot(testlink, RLT.Pred$Prediction, 
       xlim = c(-2, 2), ylim = c(-2, 2), 
       pch = 19, col = ifelse(cover, "green", "red"), 
       xlab = "True Mean", ylab = "Estimated Mean and 95% CI")
  lines(x = c(-2, 2), y = c(-2, 2))
  
  # the confidence interval
  for (i in 1:nrow(testdata))
    lines(x = c(testlink[i], testlink[i]),
          y = c(lowerbound[i], upperbound[i]))

  
  ## average length is typically smaller than Jackknife
  ## you may occasionally get negative estimations...
  mean(sqrt(RLT.Pred$Variance), na.rm = TRUE)
## [1] 0.1665776

Practice Example

Use the Boston dataset to predict medv using four variables lstat, rm, dis, and crim (columns 1, 6, 8, and 13). Predict a new listing with lstat = 2, rm = 8, dis = 6 and crim = 0.01. It should have relatively high prediction.

  data("Boston", package = "MASS")

  # training the forest
  library(RLT)
  RLT.boston <- RLT(x = Boston[, c(1, 6, 8, 13)], y = Boston$medv, 
                    ntrees = 10000, mtry = 4, nmin = 5,
                    resample.replace = FALSE, resample.prob = 0.5, 
                    param.control = list("var.ready" = TRUE, 
                                         "resample.track" = TRUE),
                    verbose = TRUE)
## Regression Random Forest ... 
## ---------- Parameters Summary ----------
##               (N, P) = (506, 4)
##           # of trees = 10000
##         (mtry, nmin) = (4, 5)
##       split generate = Random, 1
##             sampling = 0.5 w/o replace
##   (Obs, Var) weights = (No, No)
##           importance = none
##        reinforcement = No
## ----------------------------------------
  
  testboston = data.frame("lstat" = 2, 
                        "rm" = 8, 
                        "dis" = 6,
                        "crim" = 0.01)
    
  RLT.Pred <- predict(RLT.boston, testboston, var.est = TRUE)
  
  # confidence interval
  lowerbound = RLT.Pred$Prediction - 1.96*sqrt(RLT.Pred$Variance)
  upperbound = RLT.Pred$Prediction + 1.96*sqrt(RLT.Pred$Variance)
  c(lowerbound, upperbound)
## [1] 43.22363 46.68151

References

Athey, Susan, Julie Tibshirani, and Stefan Wager. 2019. “Generalized Random Forests.” The Annals of Statistics 47 (2): 1148–78.

Efron, Bradley. 1992. “Jackknife-After-Bootstrap Standard Errors and Influence Functions.” Journal of the Royal Statistical Society Series B: Statistical Methodology 54 (1): 83–111.

———. 2014. “Estimation and Accuracy After Model Selection.” Journal of the American Statistical Association 109 (507): 991–1007.

Hoeffding, Wassily. 1948. “A Class of Statistics with Asymptotically Normal Distribution.” The Annals of Mathematical Statistics, 293–325.

Ishwaran, Hemant, and Min Lu. 2019. “Standard Errors and Confidence Intervals for Variable Importance in Random Forest Regression, Classification, and Survival.” Statistics in Medicine 38 (4): 558–82.

Lee, A J. 2019. U-Statistics: Theory and Practice. Routledge.

Mentch, Lucas, and Giles Hooker. 2016. “Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests.” The Journal of Machine Learning Research 17 (1): 841–81.

Sexton, Joseph, and Petter Laake. 2009. “Standard Errors for Bagged and Random Forest Estimators.” Computational Statistics & Data Analysis 53 (3): 801–11.

Shao, Jun, and CF Jeff Wu. 1989. “A General Theory for Jackknife Variance Estimation.” The Annals of Statistics, 1176–97.

Wager, Stefan, Trevor Hastie, and Bradley Efron. 2014. “Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife.” The Journal of Machine Learning Research 15 (1): 1625–51.

Xu, Tianning, Ruoqing Zhu, and Xiaofeng Shao. 2022. “On Variance Estimation of Random Forests with Infinite-Order u-Statistics.” arXiv e-Prints, arXiv–2202.

Zhou, Zhengze, Lucas Mentch, and Giles Hooker. 2021. “V-Statistics and Variance Estimation.” The Journal of Machine Learning Research 22 (1): 13112–59.

Variance Estimation in Random Forest

Ruoqing Zhu

Last Updated: October 10, 2023

Normality and Confidence Interval of Random Forest

Bootstrap: Basics

Bootstraped Confidence Interval for Variable Importance

Question

Jackknife: Basics

Delete-\(d\) Jackknife Confidence Interval for Variable Importance

Practice Example

Jackknife Procedure for Variance of Prediction

The Infinitesimal Jackknife

Practice Example

U-statistics: Basics

Example: Sample Variance as U-statistics

Estimating the Variance of U-Statistics

Practice Example

References