Random Forests and Kernel View

1 Random Forests and Kernel Weights

Random forests are widely viewed as a black-box prediction tool. They average many trees, each producing a piecewise constant fit. However, at a deeper level, every random forest predictor can be viewed as a (adaptive) kernel method. The idea of this kernel view was originally explored in Breiman (2000). More connections are explored in Lin and Jeon (2006), Scornet (2016), Athey, Tibshirani, and Wager (2019) and many other papers. Here we provide a brief overview of this perspective.

A tree-induced kernel is defined as follows. Consider a single tree \(\cT\) built on the training data \(\{(X_i, Y_i)\}_{i=1}^n\), with terminal nodes \(A_1, A_2, \ldots, A_T\). For any target point, if it falls into the terminal node \(A_m\), then the prediction is the average of outcomes of training points in the same node \(A_l\), i.e.,

\[ \frac{1}{n_{A_l}} \sum_{i: X_i \in A_l} Y_i = \sum_{i=1}^n \frac{1(X_i \in A_l)}{n_{A_l}} Y_i \]

where \(n_{A_l} = \sum_{i=1}^n 1(X_i \in A_l)\) is the number of training points in node \(A_l\). Hence, for any target point \(x\), the prediction of a tree model is search of the terminal node it falls into, and then use the node average. This can be expressed as a weighted average of all training outcomes

\[ \begin{aligned} \hat{f}_\cT(x) &= \sum_{i=1}^n \color{darkorange}{\sum_{l=1}^T \frac{1(X_i \in A_l)}{n_{A_l}} 1(x \in A_l)} Y_i \\ &= \sum_{i=1}^n \color{darkorange}{w_i(x)} Y_i \end{aligned} \]

And the forest prediction is

\[ \hat{f}_{RF}(x) = \frac{1}{B} \sum_{b=1}^B \hat{f}_{\cT_b}(x) = \sum_{i=1}^n \left( \frac{1}{B} \sum_{b=1}^B w_i^{(b)}(x) \right) Y_i. \]

in which \(\frac{1}{B} \sum_{b=1}^B w_i^{(b)}(x)\) can be regarded as the aggregated forest weight.

2 Random Forest induced Kernel

In fact, we can also define the corresponding kernel function of a tree. For any two points \(x, z \in \RR^p\),

\[ K_\cT(x, z) = \sum_{l = 1}^T \frac{1}{n_{A_l}} 1(x \in A_l) 1(z \in A_l) \]

This kernel function is a positive definite kernel that is equivalent to the feature map

\[ \Phi_\cT(x) = \bigg( \frac{1(x \in A_1)}{\sqrt{n_{A_1}}}, \ldots, \frac{1(x \in A_T)}{\sqrt{n_{A_T}}} \bigg)^\top \]

And for random forests, the aggregated kernel is simply the average of individual tree kernels

\[ K_{RF}(x, z) = \frac{1}{B} \sum_{b=1}^B K_{\cT_b}(x, z), \]

which is a valid positive definite kernel as well. However, it is worth to point out that a random forest is not an RKHS method at all since it does not optimize any regularized loss in an RKHS space. Instead the kernel is learned in an adaptive way from the data, and the kernel weights are not learned through a global optimization problem.

3 An Approximation of Purely Random Forests

We should also mention that this kernel is not a proper density function since it does not integrate to 1 when fixing any \(x\). To make it a proper density function, we can either use the actual measure of each leaf \(|A_m|\) to replace the training size, or simply use an equal weight across all leaves, if we think all terminal nodes are relatively of the same size (controlled by the parameter nodesize). The first strategy makes sense when we consider the population (infinite) version of the kernel, i.e., when the training size \(n \to \infty\), and also if \(X\) is distributed uniformly (i.e. \([0,1]^p\)), then the probability of falling into a node \(A_m\) is exactly the measure of that node \(|A_m|\). But still, it is difficult to analyze the exact form of this kernel function since the shape and size of each node is random and data-adaptive. In Breiman (2000), the author considered a simplified version of trees and analyzed the corresponding kernel function. Here are the (modified) conditions:

The domain is \([0,1]^p\) and \(X\) is uniformly distributed.
Grow a tree to \(T+1\) leaves. At each of the \(T\) splits:
- Choose an active leaf uniformly from the current leaves.
- Split that leaf along the prescribed cycling coordinate sequence \((1, 2, ..., p, 1, 2, ..., p, ...)\)
- Choose the split point uniformly within the range of the chosen coordinate.

It turns out that if we follow this procedure, the infinite tree version of a random forests kernel is almost a Laplace kernel. More specifically,

\[ K_\infty(x, z) \approx \exp\!\Big(- \frac{\log T}{p} \cdot \|x - z\|_1\Big). \]

The proof is fairly simple. If we define the gap on the \(j\)th coordinate between \(x\) and \(z\) as \(d_j = |x_j-z_j| \in [0,1]\). We want to investigate the probability that \(x\) and \(z\) fall into the same terminal node after growing a tree to \(T\) leaves, i.e.,

\[ J_{\mathcal T}(x,z) = \mathbf 1\{x \text{ and } z \,\text{ end in the same terminal node of }\mathcal T\}. \]

Hence, the population (infinite-forest) kernel is the expectation of this since the splits are essentially independent of the data points, and also independent of each other:

\[ K_\infty(x,z) \;=\; \mathbb{E}[J_{\mathcal T}(x,z)]. \]

If \(M\) denotes the number of times the node that contains both \(x\) and \(z\) is selected during growth, then the chance that it gets selected accumulates over time:

\[ M\ \stackrel d=\ \sum_{\nu=1}^{T}\mathrm{Bernoulli}(1/\nu), \]

Since each step of this selection is independent, the expectation is a harmonic number for large \(T\):

\[ \mathbb{E}[M] = \sum_{\nu=1}^{T} \frac{1}{\nu} = H_{T} \approx \log T + 0.5772. \]

And based on our modified algorithm, the cell that contains both \(x\) and \(z\) will be cut \(M/p\) times¹. Hence let’s look at the probability that \(x\) and \(z\) are not separated after \(k\) cuts on a single coordinate with gap \(d\). This is equivalent to the probability that all \(k\) cuts fall outside the gap \((x_j, z_j)\). Realizing that at each split, by randomly drawing a uniform cut point at the current node along a given dimension, the dimension length shrinks in a multiplicative way. The final length after \(k\) cuts (if it didn’t fall in between \(x\) and \(z\)) is distributed as

\[ \begin{aligned} b_k &\stackrel d=\ \prod_{i=1}^k U_i \quad \text{where } U_i \stackrel{iid}{\sim} \mathrm{Uniform}(0,1). \\ \log b_k &\stackrel d=\ \sum_{i=1}^k \log U_i \\ &\stackrel d=\ - \sum_{i=1}^k E_i \quad \text{ where } E_i \stackrel{iid}{\sim} \mathrm{Exponential}(1). \\ &\stackrel d=\ - \mathrm{Gamma}(k,1) \end{aligned} \]

and the probability that \(b_k\) is larger than the gap \(d\) is

\[ \begin{aligned} \Pr(b_k > d) &= \Pr(-\log b_k < -\log d) \\ &= \Pr\Big(\mathrm{Gamma}(k,1) < \log\frac{1}{d} \Big). \end{aligned} \]

Assume that \(M\) is a multiple of \(p\), each coordinate is cut exactly \(M/p\) times. Then

\[ K_{\infty}(x,z) =\mathbb E\!\left[\ \prod_{j=1}^p \Pr\!\big(\mathrm{Gamma}(M/p,1) < \log(1/d_j)\big) \ \right]. \]

At here, Breiman provided some rough approximations. But essentially, we can provide an approximation of the upper bound. First, fix \(M\) at its expectation \(\log T\), we will split the node \(\log(T)/p\) times on each coordinate. Then, the cut at each attempt has at most \(1 - d_j\) chance of not separating \(x\) and \(z\).² Hence the chance that \(x_j\) and \(z_j\) are not separated after \(\log(T)/p\) cuts is at most \((1 - d_j)^{\log(T)/p}\). Further more,

\[ \Pr(x_j \text{ and } z_j \text{ not separated after } k \text{ splits}) \le (1 - d_j)^k \le e^{-k d_j} \]

by the fact that \(1-x \le e^{-x}\) for any \(x\). Thus we have

\[ \begin{aligned} K_\infty(x,z) & \le \prod_{j=1}^p \Pr(x_j \text{ and } z_j \text{ not separated after } \log(T)/p \text{ splits}) \\ & \le \prod_{j=1}^p e^{-\frac{\log(T)}{p} d_j} \\ &= e^{-\frac{\log(T)}{p} \sum_{j=1}^p d_j} \\ &= e^{-\frac{\log(T)}{p} \|x-z\|_1}. \end{aligned} \]

Hence this kernel is upper bounded by a Laplace kernel. Although this is a rough upper bound, it still provides some insights about the purpose of a random forest kernel. The kernel weights decay exponentially as the \(\ell_1\) distance between \(x\) and \(z\) increases. The rate of decay is controlled by \(\log(T)/ p\), which means that it is proportional to the number of leaves \(T\) (more leaves means smaller nodes, thus faster decay), and inversely proportional to the dimension \(p\) (higher dimension means slower decay). In fact, there can be a better and more precise analysis of the random forest kernel, however, it needs to be done under some carefully designed splitting mechanism, which is the Mondrian process that we will discuss next.

4 Mondrian Process and Forests

The purely random forest analysis above yielded a Laplace-type upper bound for the infinite-forest kernel. Another variant of the random forests, called the (Lakshminarayanan, Roy, and Teh 2014), based on a continuous-time random partition (Mondrian process) on axis-aligned rectangles (Roy and Teh 2008) are understood much better. The Mondrian process uses the following mechanism to grow each tree:

Fix a budget parameter \(\lambda>0\), also called the lifetime of the Mondrian process.
Initialize a global cost time \(t=0\) and \(\cL = [0,1]^p\).
Iterate the splitting process until \(t>\lambda\):
1. For each current node \(A=\prod_{j=1}^p [a_j,b_j) \in \cL\), define its \(\ell_1\) linear size \[ r(A)\;=\;\mathrm{lin}(A)\;=\;\sum_{j=1}^p (b_j-a_j). \]
2. To select a node to split, draw an exponential random variable \(t_A \sim \mathrm{Exp}(r(A))\) for each \(A\in\cL\), and pick the one that arrives first, i.e., the node with the smallest \(t_A\). Let \(\delta_t =\min_{A} t_A\).
3. Draw a split coordinate \(j\) with probability \((b_j-a_j)/r(A)\)
4. Draw a split location uniformly on \([a_j,b_j)\). Split \(A\) into two children along dimension \(j\) on this cut, and update the leaf set.
5. Update the global cost time \(t \leftarrow t + t_A\).

Interestingly, the infinite forest kernel of the Mondrian forest can be computed exactly, which is the Laplace kernel we discussed earlier. The intuition of the proof is that if we think about the rectangle between \(x\) and \(z\) as a clock with rate \(s = \|x-z\|_1\) that is competing with the rest of the tree, then the rate that this clock rings will always stays as \(c\), which is independent of how the rest of the space is split. Thus the chance that this clock never rings before time \(\lambda\) is exactly the tail of an exponential distribution with rate \(c\) evaluated at \(\lambda\). To be more specific, let

\[ C_{xz}=\prod_{j=1}^p[\min\{x_j,z_j\},\max\{x_j,z_j\}], \]

be the rectangle spanned by \(x\) and \(z\). Let \(d_j = |x_j-z_j|\) be the gap of dimension \(j\), and \(s = \|x-z\|_1 = \sum_{j=1}^p d_j\) is the linear size of the block \(C_{xz}\). First, we recall some properties of exponential distributions:

If \(X_i \sim \mathrm{Exp}(\lambda_i)\) are independent, then \(\min_i X_i \sim \mathrm{Exp}(\sum_i \lambda_i)\).
\(\Pr(X_i = \min_j X_j) = \lambda_i / \sum_j \lambda_j\).
The event that \(X_i = \min_j X_j\) is independent of the value of \(\min_j X_j\).

Furthermore, the Exponential distribution has a rate parameter \(\lambda\) that is also the hazard (rate per unit time) of the event happening. This means the density over the survival function is constant:

\[ \frac{f(t)}{\Pr(X > t)} = \frac{\lambda e^{-\lambda t}}{e^{-\lambda t}} = \lambda. \]

Noticing that at any split, the chance of selecting the node \(A^\ast\) that contains \(C_{xz}\) is proportional to its linear size. And the overall linear size of the entire space of the current tree is

\[ R = \sum_{\text{all leaves } A} r(A). \]

The probability of selecting \(A^\ast\) is \(r(A^\ast)/R\). Then, given that we if select \(A^\ast\), the probability of selecting any cut, regardless of the dimension, between \(x\) and \(z\) is \(\| x - z \|_1 / r(A^\ast)\). And these two processes are independent, hence the probability that the next split separates \(x\) and \(z\) is

\[ \Pr(\text{next split separates } x, z) = \frac{r(A^\ast)}{R} \cdot \frac{\|x-z\|_1}{r(A^\ast)} = \frac{s}{R}. \]

However, if we notice that we are using exponential clocks to select the next split, we could look at the rate of the event that the next split separates \(x\) and \(z\). For this, it could helpful to define two independent clocks:

the separation clock: \(T \sim \mathrm{Exp}(s)\)
the background clock: \(C \sim \mathrm{Exp}(R-s)\) (here \(R\) could be evolving as we split the tree, but fixed at each stage)

Conditioning on the current tree at time \(t\) we can consider a small interval of length \(\Delta t\) to investigate the rate that a separation occurs:

\[ \begin{aligned} & \lim_{\Delta t \rightarrow 0} \frac{1}{\Delta t} \Pr( T \, \text{ happens between }(t,t+\Delta t) \mid \text{Current Tree} ) \\ =& \, \lim_{\Delta t \rightarrow 0} \frac{1}{\Delta t} \Pr( T \text{ or } C \, \text{ happens between }(t,t+\Delta t) | \text{Current Tree} ) \\ & \qquad \quad \times \Pr(T \text{ happens before } C \mid T \text{ or } C \in (t, t + \Delta t) \text{ with Current Tree } )\\ =& \, \lim_{\Delta t \rightarrow 0} \frac{1}{\Delta t} \Pr\big( \text{Exp}(R) \in (t, t + \Delta t) \big) \cdot \frac{s}{R} \\ =& \, \lim_{\Delta t \rightarrow 0} \frac{ R (\Delta t + o(\Delta t)) }{\Delta t} \times \frac{s}{R} \\ =& \, s . \end{aligned} \]

Hence the instantaneous (rate per unit time) of separation \(T\) is a constant rate \(s\), independent of \(R\). And this remains the same as the rest of the tree keeps evolving after non-separating splits. In fact, this is also a standard survival analysis argument. Now, the distribution with constant hazard \(s\) is exactly an exponential distribution with rate \(s\). Thus, the probability that \(x\) and \(z\) are not separated before time \(\lambda\) is

\[ K_\infty(x,z) = \Pr(T > \lambda) = e^{-\lambda s} = e^{-\lambda \|x-z\|_1}. \]

This makes the infinite Mondrain process forest kernel exactly the Laplace kernel with parameter \(\lambda\).

5 Random Forests for Regression

Typically, because of all the randomness involved in a random forest, it is difficult to have exact theoretical rate of convergence. However, given some simplifications, existing literature has provided some theoretical understandings. For example, in Scornet (2016), the authors considered a version called the centered KeRF (Kernel based on Random Forests) and provided a convergence rate. The centered KeRF uses the following mechanism to grow each tree:

The domain is \([0,1]^p\) and \(X\) is uniformly distributed.
The regression function \(m(x)\) is Lipschitz.
At each node, pick a coordinate uniformly at random from \(\{1,\dots,p\}\).
The split is at the center of the chosen coordinate.
The tree grows to full depth \(k\), so each cell is cut \(k\) times and there are exactly \(2^k\) terminal nodes.

Under this setting, the infinite forest kernel is able to achieve the rate of

\[ \E[(\hat{m}_{\infty,n}(x) - m(x))^2] \le C (\log n)^2 n^{-\frac{1}{3 + p \log_2}} \]

This is significantly slower than the classical minimax rate for Lipschitz functions \(n^{-2/(p+2)}\) under similar regularity conditions. However, this rate is obtained under a very simplified setting. In practice, random forests are known to be adaptive to low-dimensional structures (e.g., sparse signals) and can achieve much better performance than other nonparametric methods.

Athey, Susan, Julie Tibshirani, and Stefan Wager. 2019. “Generalized Random Forests.” The Annals of Statistics 47 (2): 1148–78.

Breiman, Leo. 2000. “Some Infinity Theory for Predictor Ensembles.” 579. Berkeley, CA, USA: Statistics Department, University of California, Berkeley.

Lakshminarayanan, Balaji, Daniel M Roy, and Yee Whye Teh. 2014. “Mondrian Forests: Efficient Online Random Forests.” Advances in Neural Information Processing Systems 27.

Lin, Yi, and Yongho Jeon. 2006. “Random Forests and Adaptive Nearest Neighbors.” Journal of the American Statistical Association 101 (474): 578–90.

Roy, Daniel M, and Yee W Teh. 2008. “The Mondrian Process.” In Advances in Neural Information Processing Systems, 1377–84.

Scornet, Erwan. 2016. “Random Forests and Kernel Methods.” IEEE Transactions on Information Theory 62 (3): 1485–1500.

In the original paper of Breiman (2000), the author used a slightly different algorithm to grow the tree. Instead of rotating the coordinates, at each split, the author randomly selects a coordinate from \(\{1,\dots,p\}\) with equal probability. This makes the analysis slightly more complicated to incorporate the bionmial distribution of the number of cuts on each coordinate. Here we use a cyclic coordinate sequence to simplify the analysis.↩︎
This is a very rough upper bound since the chance of not separating \(x\) and \(z\) should be decreasing as we make more cuts: after \(k\) cuts, the chance of not separating \(x\) and \(z\) is at most \((1-d)^k\), which is much smaller than \(1-d\) when \(k\) is large. But this rough bound gives a nice closed form.↩︎