\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\cA}{\mathcal{A}}\) \(\newcommand{\cB}{\mathcal{B}}\) \(\newcommand{\cC}{\mathcal{C}}\) \(\newcommand{\cD}{\mathcal{D}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cF}{\mathcal{F}}\) \(\newcommand{\cG}{\mathcal{G}}\) \(\newcommand{\cH}{\mathcal{H}}\) \(\newcommand{\cI}{\mathcal{I}}\) \(\newcommand{\cJ}{\mathcal{J}}\) \(\newcommand{\cK}{\mathcal{K}}\) \(\newcommand{\cL}{\mathcal{L}}\) \(\newcommand{\cM}{\mathcal{M}}\) \(\newcommand{\cN}{\mathcal{N}}\) \(\newcommand{\cO}{\mathcal{O}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\cQ}{\mathcal{Q}}\) \(\newcommand{\cR}{\mathcal{R}}\) \(\newcommand{\cS}{\mathcal{S}}\) \(\newcommand{\cT}{\mathcal{T}}\) \(\newcommand{\cU}{\mathcal{U}}\) \(\newcommand{\cV}{\mathcal{V}}\) \(\newcommand{\cW}{\mathcal{W}}\) \(\newcommand{\cX}{\mathcal{X}}\) \(\newcommand{\cY}{\mathcal{Y}}\) \(\newcommand{\cZ}{\mathcal{Z}}\) \(\newcommand{\bA}{\mathbf{A}}\) \(\newcommand{\bB}{\mathbf{B}}\) \(\newcommand{\bC}{\mathbf{C}}\) \(\newcommand{\bD}{\mathbf{D}}\) \(\newcommand{\bE}{\mathbf{E}}\) \(\newcommand{\bF}{\mathbf{F}}\) \(\newcommand{\bG}{\mathbf{G}}\) \(\newcommand{\bH}{\mathbf{H}}\) \(\newcommand{\bI}{\mathbf{I}}\) \(\newcommand{\bJ}{\mathbf{J}}\) \(\newcommand{\bK}{\mathbf{K}}\) \(\newcommand{\bL}{\mathbf{L}}\) \(\newcommand{\bM}{\mathbf{M}}\) \(\newcommand{\bN}{\mathbf{N}}\) \(\newcommand{\bO}{\mathbf{O}}\) \(\newcommand{\bP}{\mathbf{P}}\) \(\newcommand{\bQ}{\mathbf{Q}}\) \(\newcommand{\bR}{\mathbf{R}}\) \(\newcommand{\bS}{\mathbf{S}}\) \(\newcommand{\bT}{\mathbf{T}}\) \(\newcommand{\bU}{\mathbf{U}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\ba}{\mathbf{a}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\bd}{\mathbf{d}}\) \(\newcommand{\be}{\mathbf{e}}\) \(\newcommand{\bg}{\mathbf{g}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bi}{\mathbf{i}}\) \(\newcommand{\bj}{\mathbf{j}}\) \(\newcommand{\bk}{\mathbf{k}}\) \(\newcommand{\bl}{\mathbf{l}}\) \(\newcommand{\bm}{\mathbf{m}}\) \(\newcommand{\bn}{\mathbf{n}}\) \(\newcommand{\bo}{\mathbf{o}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bq}{\mathbf{q}}\) \(\newcommand{\br}{\mathbf{r}}\) \(\newcommand{\bs}{\mathbf{s}}\) \(\newcommand{\bt}{\mathbf{t}}\) \(\newcommand{\bu}{\mathbf{u}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\RR}{\mathbb{R}}\) \(\newcommand{\NN}{\mathbb{N}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\hpi}{\widehat{\pi}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\hbpi}{\widehat{\boldsymbol{\pi}}}\) \(\newcommand{\bxi}{\boldsymbol{\xi}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bepsilon}{\boldsymbol{\epsilon}}\) \(\newcommand{\bzero}{\mathbf{0}}\) \(\newcommand{\T}{\text{T}}\) \(\newcommand{\Trace}{\text{Trace}}\) \(\newcommand{\Cov}{\text{Cov}}\) \(\newcommand{\Var}{\text{Var}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\Pr}{\text{Pr}}\) \(\newcommand{\pr}{\text{pr}}\) \(\newcommand{\pdf}{\text{pdf}}\) \(\newcommand{\P}{\text{P}}\) \(\newcommand{\p}{\text{p}}\) \(\newcommand{\One}{\mathbf{1}}\) \(\newcommand{\argmin}{\operatorname*{arg\,min}}\) \(\newcommand{\argmax}{\operatorname*{arg\,max}}\) \(\newcommand{\dtheta}{\frac{\partial}{\partial\theta} }\) \(\newcommand{\ptheta}{\nabla_\theta}\) \(\newcommand{\alert}[1]{\color{darkorange}{#1}}\) \(\newcommand{\alertr}[1]{\color{red}{#1}}\) \(\newcommand{\alertb}[1]{\color{blue}{#1}}\)

1 Regression Random Forests Recap

Random forests extend decision trees through ensemble, and in their classical regression form they are designed to estimate the conditional mean function

\[ m(x) = \mathbb{E}[Y \mid X = x], \]

given training data \(\{(X_i, Y_i)\}_{i=1}^n\). A regression tree partitions the covariate space \(\mathcal{X}\) into \(M\) disjoint terminal nodes \(A_1, \dots, A_M\). For a new input \(x \in A_m\), the tree prediction is the sample average of outcomes in that region,

\[ \hat f_{A_m} = \frac{1}{|A_m|} \sum_{i: X_i \in A_m} Y_i, \]

with trees fitted on randomly sampled subsets of data and features. A random forest then averages a large number of such trees to reduce variance and improve stability. We have shown previously that the forest prediction is also a kernel method. More importantly, the essential component we will look at in this lecture is the splitting rule used in tree construction, which is reduction of variance in a regression problem:

\[ \begin{aligned} &\Var(A) - \left( \frac{|A_L|}{|A|} \Var(A_L) + \frac{|A_R|}{|A|} \Var(A_R) \right)\\ =& \frac{1}{|A|}\sum_{i: X_i \in A} (Y_i - \bar Y_A)^2 - \left( \frac{|A_L|}{|A|} \frac{ \sum_{i: X_i \in A_L} (Y_i - \bar Y_{A_L})^2 }{|A_L|} + \frac{|A_R|}{|A|} \frac{ \sum_{i: X_i \in A_R} (Y_i - \bar Y_{A_R})^2 }{|A_R|} \right) \\ =& \sum_{i: X_i \in A} (Y_i - \bar Y_A)^2 - \left( \sum_{i: X_i \in A_L} (Y_i - \bar Y_{A_L})^2 + \sum_{i: X_i \in A_R} (Y_i - \bar Y_{A_R})^2 \right), \end{aligned} \]

which is corresponding to the ANOVA decomposition of the total sum of squares. This splitting rule is intuitive and effective for estimating conditional means, but it does not generalize to other parameters of interest. The generalized random forest (GRF) framework proposed by Athey, Tibshirani, and Wager (2019) overcomes this limitation by replacing the variance reduction splitting rule with a more general and theoretically justified criterion based on influence functions.

2 Influence Function

The influence function is a fundamental concept in statistics. It is typically a semiparametric and robust statistics concept but here we just demonstrate it in a parametric setting, which can be easily connected with the score function, Fisher information, and M-estimators. The influence function provides a way to understand the robustness and efficiency of an estimator. It measures the sensitivity of an estimator to small changes in the data distribution, which then allows us to analyze the asymptotic behavior of the estimator. Let \(P\) denote the distribution of a random variable \(Y\), then a statistical parameter can often be written as a functional of \(P\). For example,

  • Mean: \(T(P) = \mathbb{E}_P[Y] = \int y \, dP(y)\);
  • Variance: \(T(P) = \Var_P(Y)\);
  • Quantile: \(T(P) = F_P^{-1}(\tau)\).

Suppose we slightly perturb \(P\) by introducing a small amount of point mass at an observation \(y\). This new perturbed distribution can be expressed as

\[ P_\varepsilon = (1 - \varepsilon) P + \varepsilon \, \delta_y, \]

where \(\delta_y\) is the Dirac measure at \(y\) and \(\varepsilon > 0\) is small. The influence function then examines how the statistical parameter \(T(P)\) changes as we vary \(\varepsilon\):

\[ \begin{aligned} \text{IF}(y; T, P) &= \lim_{\varepsilon \to 0} \frac{T(P_\varepsilon) - T(P)}{\varepsilon} \\ &= \left. \frac{d}{d\varepsilon} T\big( \varepsilon \, \delta_y + (1 - \varepsilon) P \big) \right|_{\varepsilon = 0} \end{aligned} \]

which is the Gateaux derivative of the functional \(T\) at \(P\) in the direction of \(\delta_y - P\). You may think of the distribution \(P\) as a (infinite) vector of probabilities assigned to each possible observation, and the influence function measures the sensitivity of the functional \(T\) to a small change in this vector. This derivative quantifies the infinitesimal effect that adding a small mass at \(y\) has on the parameter estimate. With this understanding, we can construct a first-order functional approximation of the estimator \(T\) at its truth \(T(P)\):

\[ \begin{aligned} T(P_\varepsilon) &= T(P) + \int \text{IF}(y; T, P) d(P_\varepsilon - P)(y) + R(\varepsilon) \\ &= T(P) + \int \text{IF}(y; T, P) d( \varepsilon \delta_y + (1 - \varepsilon) P - P)(y) + R(\varepsilon) \\ &= T(P) + \varepsilon \int \text{IF}(y; T, P) d(\delta_y - P)(y) + R(\varepsilon) \end{aligned} \]

where \(R(\varepsilon)\) is a remainder term that goes to zero faster than \(\varepsilon\) as \(\varepsilon \to 0\). With this understanding, we can analyze the estimator \(T\) on the observed empirical distribution \(P_n\) based on a sample \(Y_1, \dots, Y_n\), i.e.,

\[ P_n = \frac{1}{n} \sum_{i=1}^n \delta_{Y_i}. \]

Noticing that the difference of the empirical distribution \(P_n\) and the true distribution \(P\) can be expressed as

\[ P_n - P = \frac{1}{n} \sum_{i=1}^n (\delta_{Y_i} - P) = \sum_{i=1}^n \varepsilon_i (\delta_{Y_i} - P) \]

with all \(\varepsilon_i = 1/n\). Hence the difference between \(P_n\) and \(P\) can be viewed as a sum of small perturbations, each adding a small mass \(\varepsilon_i\) at the observed data point \(Y_i\) and removing the same amount of mass uniformly from all possible observations. This allows us to apply the functional Taylor expansion to approximate \(T(P_n)\):

\[ \begin{aligned} T(P_n) &= T(P) + \int \text{IF}(y; T, P) d(P_n - P)(y) + R(P_n - P) \\ &= T(P) + \int \text{IF}(y; T, P) d P_n(y) - \int \text{IF}(y; T, P) d P(y) + R(P_n - P) \\ &= T(P) + \frac{1}{n} \sum_{i=1}^n \text{IF}(Y_i; T, P) - \int \text{IF}(y; T, P) dP(y) + R(P_n - P) \\ &= T(P) + \frac{1}{n} \sum_{i=1}^n \text{IF}(Y_i; T, P) + R(P_n - P) \end{aligned} \]

with the understanding that the integral of the influence function with respect to the distribution \(P\) is zero1. Hence this decomposition shows that the error of the estimator \(T(P_n) - T(P)\) can be approximated by the average of the influence functions evaluated at the observed data points, plus a small remainder term that goes to zero faster than \(\sqrt{n}\) rate.

\[ T(P_n) - T(P) \approx \frac{1}{n} \sum_{i=1}^n \text{IF}(Y_i; T, P) \]

Note that the right hand side is the average of a set of i.i.d. terms, this is crucial for establishing the asymptotic normality of the estimator via the central limit theorem. Noticing that the right hand side has mean 0, and are independent sums, the variance is simply the average of the square of the influence functions, i.e.,

\[ \widehat{\Var}(T(P_n)) = \frac{1}{n^2} \sum_{i=1}^n \widehat{\text{IF}}^2(Y_i; T, P). \]

2.1 Example: Mean Estimation

Let’s take a look at an example of the influence function for the mean estimation. For the mean, we have \(\theta = T(P) = \mathbb{E}_P[Y]\). Consider \(P_\varepsilon = (1-\varepsilon)P + \varepsilon \delta_y\), i.e., putting a small point mass at \(y\). Then

\[ T(P_\varepsilon) = (1-\varepsilon)\mathbb{E}_P[Y] + \varepsilon y \]

And we can calculate the influence function as

\[ \begin{aligned} \text{IF}(y;T,P) &= \left.\frac{d}{d\varepsilon}T(P_\varepsilon)\right|_{\varepsilon=0}\\ &= y - \mathbb{E}_P[Y]. \end{aligned} \]

Note that \(\int \text{IF}(y;T,P)\, dP(y) = \mathbb{E}_P[Y] - \mathbb{E}_P[Y] = 0\) confirms our earlier claim. Therefore, the influence function for the mean functional is simply the deviation of the observation \(y\) from the population mean \(\mathbb{E}_P[Y]\). Empirically, to calculate the influence function at each data point \(Y_i\), we can replace the population mean with the sample mean \(\bar{Y}\):

\[ \text{IF}(Y_i;T,P) = Y_i - \bar{Y}. \]

To obtain the variance, note that \(\Var(\text{IF}(Y_i;T,P)) = \Var(Y_i - \bar{Y})\), which can be approximated with

\[ \widehat{\Var}(\text{IF}(Y_i;T,P)) = \frac{1}{n} \sum_{i=1}^n (Y_i - \bar{Y})^2. \]

which is exactly the sample variance.

3 Generalized Random Forests

The idea of the splitting rule in generalized random forests is to use the influence function at each node to guide the splitting, instead of using variance reduction. This is typically a semiparametric model topic, but here we will only give a brief introduction. The key idea is to consider a more general parameter of interest \(\theta\) defined through a moment condition, rather than just the mean. Something like the following, which we use \(\psi\) to denote the score function corresponding to the parameter of interest:

\[ \mathbb{E}[\psi(Y; \theta)] = 0, \]

At each internal node, we have a subset of data points \(\{(X_i, Y_i): X_i \in A\}\) corresponding to the region \(A\). By treating the data in this node as coming from i.i.d. samples from the same distribution, we can compute the influence function for each observation in this node with respect to the parameter of interest. We then use these influence function values to perform a variance reduction type of splitting, similar to the classical regression tree. Why this makes sense? Recall that the influence function measures the sensitivity of the estimator to small changes in the data distribution. By splitting the node in a way that maximizes the difference in average influence function values between child nodes, we are effectively trying to put observations that have positive influence on the parameter estimate into one side of the split, and those with negative influence into the other side. This leads to child nodes that are more homogeneous in terms of their contribution to the parameter estimate, which reduces the variance of the estimator within each node.

Let’s consider a regression problem where we observe \(\{x_i, y_i\}_{i=1}^n\). The parameter of interest is the conditional mean \(m(x) = \E( Y \mid X=x)\). However, at each internal node, we will simply treat all observations at i.i.d., and estimate the node parameter \(\theta\). After calculating the influence function for each observation in the node, we will use these influence function values as pseudo-responses to perform a variance reduction type of splitting. For the regression case, since the influence function is the centered \(Y_i\) values within the node, the splitting criterion becomes exactly the same as the classical regression tree splitting rule. But we could look at some other examples:

3.1 Binary Outcome

If \(Y \in \{0, 1\}\), and we are interested in estimating the conditional probability \(\theta(x) = \mathbb{P}(Y=1 \mid X=x)\). The moment condition at a given node, by treating the data as i.i.d. within a node is

\[ \mathbb{E}[Y - \theta | A] = 0. \]

To calculate the influence function, we consider the perturbed distribution \(P_\varepsilon = (1-\varepsilon)P + \varepsilon \delta_y\). Then following our definition of the influence function, we have

\[ T(P_\varepsilon) = (1-\varepsilon)\mathbb{E}_P[Y] + \varepsilon y = (1-\varepsilon)\theta + \varepsilon y \]

and

\[ \text{IF}(y;T,P) = \left.\frac{d}{d\varepsilon}T(P_\varepsilon)\right|_{\varepsilon=0} = y - \theta. \]

Empirically, we can estimate the influence function at each data point \(Y_i\) in the node by replacing \(\theta\) with the sample proportion \(\hat{\theta} = \frac{1}{|A|} \sum_{i: X_i \in A} Y_i\):

\[ \text{IF}(Y_i;T,P) = Y_i - \hat{\theta}. \]

Hence, we can see that we are not use the gini index or entropy as the splitting criterion, but rather the variance reduction based on these pseudo-responses \(Y_i - \hat{\theta}\).

3.2 Treatment Effect Estimation

Suppose we have a binary treatment \(A \in \{0, 1\}\) and an outcome \(Y\). We are interested in estimating the conditional average treatment effect (CATE) \(\tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X=x]\). Under the unconfoundedness assumption, we can identify the CATE using the following moment condition:

\[ \mathbb{E}\left[ \frac{A - e(X)}{e(X)(1 - e(X))} (Y - m(X)) - \tau(X) \mid X \right] = 0, \]

where \(e(X) = \mathbb{P}(A=1 \mid X)\) is the propensity score and \(m(X) = \mathbb{E}[Y \mid X]\) is the outcome regression. At each internal node, we can estimate the nuisance functions \(e(X)\) and \(m(X)\) using the data in that node, treating them as i.i.d. samples. The influence function for the treatment effect parameter can be derived similarly by considering the perturbed distribution and calculating the derivative. The influence function for each observation \((Y_i, A_i, X_i)\) in the node is given by

\[ \text{IF}(Y_i, A_i, X_i; \tau, P) = \frac{A_i - \hat{e}(X_i)}{\hat{e}(X_i)(1 - \hat{e}(X_i))} (Y_i - \hat{m}(X_i)) - \hat{\tau}, \]

where \(\hat{e}(X_i)\), \(\hat{m}(X_i)\), and \(\hat{\tau}\) are the estimates of the propensity score, outcome regression, and treatment effect in the node, respectively. We then use these influence function values as pseudo-responses to perform a variance reduction type of splitting, similar to the previous examples. However, one needs to be careful that when estimating the nuisance functions (\(e\) and \(m\)), the dependencies between them and the \(Y_i\)’s would complicates the theory. This is particularly important in causal inference settings to obtain unbiased estimates of treatment effects. We will have more details on causal forests in later chapters.

3.3 A General Numerical Calculation

In general, we may not need to derive the influence function analytically since the estimator we use could be complex. However, realizing that the influence function is essentially the difference between the perturbed estimator and the original estimator, we can approximate it numerically. Specifically, just use \(\varepsilon = 1/n\), we can approximate the influence function as

\[ \frac{1}{n} \text{IF}(Y_i; T, P) \approx T(P_n^{(-i)}) - T(P_n) \]

This means that to estimate the influence function at a data point \(Y_i\), we can compute the estimator \(T\) on the dataset without \(Y_i\) (i.e., leave-one-out) and compare it to the estimator on the full dataset. This approach can be restricted to the data points within a node when constructing the tree, but for any parameter of interest. The difference, regardless of the scale, can be used to find the splitting rule. This numerical approach is particularly useful when the estimator \(T\) is complex and we do not have a closed-form expression for its influence function.

4 Forest Estimation and Variance

Realizing that random forests is a kernel method, or weighted average method, we can express the forest estimator as solving \(\theta(x)\) in the local moment condition

\[ \E[\psi(Y; \theta(x)) \mid X=x] = 0 \]

where \(\psi\) is the score function corresponding to the parameter of interest. To be precise, the score function is the derivative of the loss function with respect to the parameter.

  • When the loss function is the squared error loss, the score function is \(\psi(Y; \theta) = Y - \theta\). Which gives our moment condition for the conditional mean estimation.
  • When the loss function is the negative log-likelihood for a Bernoulli distribution, the score function is \(\psi(Y; \theta) = Y - \text{logit}^{-1}(\theta) = Y - p(x)\). This gives our moment condition for the conditional probability estimation.

The locally weighted moment condition can be expressed as solving \(\widehat\theta(x)\) with

\[ \sum_{i=1}^n w_i(x) \psi(Y_i; \theta(x)) = 0 \]

where \(w_i(x)\) are the forest weights for each observation, which depend on the covariates \(X_i\) and the target point \(x\). They are properly normalized to sum to 1. These weights reflect how many times the observation \(i\) falls into the same terminal node as \(x\) across all trees in the forest, normalized by the total number of trees. Hence the weight reflects how each observation contributes to the estimation of \(\theta(x)\), with observations closer to \(x\) typically receiving higher weights.

We can actually derive the asymptotic variance of the forest estimator \(\hat{\theta}(x)\) using the influence function. The key idea is to express the estimator as a weighted average of the influence functions evaluated at each observation:

\[ \begin{aligned} \mathbf{0} &= \sum_{i=1}^n w_i(x) \psi(Y_i; \widehat{\theta}(x)) \qquad \text{by definition of} \, \widehat\theta(x) \\ &= \sum_{i=1}^n w_i(x) \psi(Y_i; \theta(x)) + \sum_{i=1}^n w_i(x) \left[ \left. \frac{\partial}{\partial \theta} \psi(Y_i; \widehat{\theta}(x)) \right|_{\theta = \theta(x)} \right] (\widehat{\theta}(x) - \theta(x)) + o_p(n^{-1/2}) \\ \end{aligned} \]

Think of this as an optimization problem with \(\widehat{\theta}\) lying close to \(\theta\) in a quadratic local approximation. Hence, if we define the Hessian matrix

\[ H(x) = \mathbb{E}\left[ \left. \frac{\partial}{\partial \theta} \psi(Y_i; \theta) \right|_{\theta = \theta(x)} \mid X = x \right], \]

Which is your typical M estimator (negative) Fisher information matrix, and we estimate it with

\[ \widehat{H}(x) = \sum_{i=1}^n w_i(x) \left. \frac{\partial}{\partial \theta} \psi(Y_i; \theta) \right|_{\theta = \widehat{\theta}(x)}. \]

Then we can re-express \(\widehat{\theta}(x) - \theta(x)\) as

\[ \widehat{\theta}(x) - \theta(x) = - \widehat{H}(x)^{-1} \sum_{i=1}^n w_i(x) \psi(Y_i; \theta(x)) + o_p(n^{-1/2}). \]

Following our previous discussion on estimating the variance of an estimator using the influence function, we can see that the variance of our estimator is the square of the weighted average of the influence functions, i.e., the sandwich form:

\[ \widehat{\Var}(\widehat{\theta}(x)) = \widehat{ H}(x)^{-1} \left( \sum_{i=1}^n w_i^2(x) \psi(Y_i; \widehat\theta(x)) \right) \widehat{H}(x)^{-1}. \]

Where \(\psi(Y_i; \widehat\theta(x))\) is the estimated influence function for observation \(Y_i\) at the parameter estimate \(\widehat\theta(x)\). However, in practice, there are two challenges in estimating this variance:

  1. The weights \(w_i(x)\) are random and depend on the data, which makes the two terms \(w_i^2(x)\) and \(\psi(Y_i; \widehat\theta(x))\) correlated. This correlation can lead to biased variance estimates if not properly accounted for. In grf, this is addressed by the “honest” approach, where the data used to construct the trees is separate from the data used to estimate the parameters within the leaves.
  2. Furthermore, in the above equation, the influence function should be evaluated using the out-of-bag predictions of \(\hat\theta\) to ensure independence between the weights and the influence function values.
  3. The influence function \(\psi(Y_i; \widehat\theta(x))\) and the Hessian can be complex and difficult to compute, especially for non-standard parameters. In grf, this is handled by using jackknife (Wager, Hastie, and Efron 2014) or bootstrap (Efron 1982) methods to estimate the variance empirically. We shall discuss this in more detail in the next chapter.

5 Example: Using the grf Package

We can illustrate this using the grf package. This is an estimation of the conditional treatment effect (CATE) using a causal forest. By default, the package will use the honest approach to construct the trees, with half samples to construct the tree structure and the other have for the and provide variance estimates for the CATE.

  library(grf)
  
  # Simulated data
  n <- 500
  p <- 5
  X <- matrix(rnorm(n * p), n, p)
  W <- rbinom(n, 1, 0.5)  
  tau <- 2 * X[,1] # the true CATE function
  
  Y <- X[,1] * 2 + W * tau + rnorm(n)

  par(mfrow=c(1,2))
  # plot data with two regression lines
  plot(X[,1], Y, col = ifelse(W==1, "deepskyblue", "darkorange"), pch=16, xlab="X1", ylab="Y")
  abline(lm(Y[W==1] ~ X[W==1,1]), col="deepskyblue", lwd=2)
  abline(lm(Y[W==0] ~ X[W==0,1]), col="darkorange", lwd=2) 
  
  # Fit a causal forest
  cf <- causal_forest(X, Y, W,
                      honesty = TRUE,
                      honesty.fraction = 0.5)
  
  # Estimate conditional treatment effects
  cf.pred <- predict(cf, estimate.variance = TRUE)
  tau_hat <- cf.pred$predictions
  
  # plot the estimated CATE against the true CATE
  plot(X[,1], tau_hat, pch=16, xlab="X1", ylab="Estimated CATE", col="black")
  points(X[,1], tau, col="red", pch=16, cex = 0.5)
  legend("bottomright", legend=c("Estimated CATE", "True CATE"), 
         col=c("black", "red"), pch=16, cex = 1.5)

  # get confidence intervals
  ci <- 1.96 * sqrt(cf.pred$variance.estimates)
  
  # plot the estimated CATE with confidence intervals for the first 50 observations
  # ordered them by their true CATE values
  tau_hat_ci <- data.frame(tau = tau, 
                           tau_hat = tau_hat,
                           lower = tau_hat - ci,
                           upper = tau_hat + ci)[1:20, ]
  
  tau_hat_ci <- tau_hat_ci[order(tau_hat_ci$tau), ]
  
  plot(tau_hat_ci$tau, pch=19, xlab="Index (ordered by true CATE)", 
       ylab="Estimated CATE with 95% CI", 
       col = "red", ylim = c(min(tau_hat_ci), max(tau_hat_ci)), 
       main = "Estimated CATE with 95% CI vs True CATE")
  points(tau_hat_ci$tau_hat, pch=19, col = "black", cex = 0.7)
  segments(1:50,
           tau_hat_ci$lower, 
           1:50,
           tau_hat_ci$upper, 
           lwd = 1.5)
  legend("topleft", legend=c("True CATE", "Estimated CATE", "95% CI"), 
         col=c("red", "black", "black"), pch=c(19,19,NA), 
         lwd=c(NA,NA,1.5), cex = 1.2)


Athey, Susan, Julie Tibshirani, and Stefan Wager. 2019. “Generalized Random Forests.” The Annals of Statistics 47 (2): 1148–78.
Efron, Bradley. 1982. The Jackknife, the Bootstrap and Other Resampling Plans. SIAM.
Wager, Stefan, Trevor Hastie, and Bradley Efron. 2014. “Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife.” The Journal of Machine Learning Research 15 (1): 1625–51.

  1. A simple intuitive explanation is that when we move \(P\) towards the direction of itself, i.e., \((1 - \varepsilon) P + \varepsilon P\), the influence function would on average not change the value of our estimator. However, the proof can be a bit technical and is omitted here. For the mean estimation, this is easier to see. For a more rigorous proof, we realize that \[ \begin{aligned} \int \text{IF}(y; T, P) dP(y) &= \int \frac{d}{d\varepsilon} T((1 - \varepsilon) P + \varepsilon \delta_y) \Big|_{\varepsilon = 0} dP(y) \\ &= \frac{d}{d\varepsilon} \int T((1 - \varepsilon) P + \varepsilon \delta_y) dP(y) \Big|_{\varepsilon = 0} \\ &= \frac{d}{d\varepsilon} \int T( P + \varepsilon (\delta_y - P) ) dP(y) \Big|_{\varepsilon = 0} \qquad (\ast) \end{aligned} \] For the term inside, we are going to linearize it around \(\varepsilon = 0\) if \(T\) is nicely behaved. This means we again break down \(T( P + \varepsilon (\delta_y - P) )\) into \(T(P)\) and linear terms of influence functions: \[ T(P + \varepsilon (\delta_y - P) ) = T(P) + \varepsilon \int \text{IF}(u; T, P) d(\delta_y - P)(u) + R(\varepsilon) \] For the integral, we have \[ \int \text{IF}(u; T, P) d \delta_y(u) = \text{IF}(y; T, P) \] and \[ \int \text{IF}(u; T, P) d P(u) \] being the thing we wanted to prove to be zero. Denote this by \(A\). Then, the integral in \((\ast)\) becomes \[ \begin{aligned} \int T( P + \varepsilon (\delta_y - P) ) dP(y) &= \int \left[ T(P) + \varepsilon (\text{IF}(y; T, P) - A) + R(\varepsilon) \right] dP(y)\\ &= T(P) + \varepsilon (A - A) + o(\varepsilon) \\ &= T(P) + o(\varepsilon) \end{aligned} \] Taking derivative with respect to \(\varepsilon\) and evaluating at \(\varepsilon = 0\) gives us zero, which completes the proof.↩︎