Fit models for regression, classification and survival
analysis using reinforced splitting rules. The model
fits regular random forest models by default unless the
parameter \code{reinforcement} is set to `"TRUE"`. Using
\code{reinforcement = TRUE} activates embedded model for
splitting variable selection and allows linear combination
split. To specify parameters of embedded models, see
definition of \code{param.control} for details.
Usage
RLT(
x,
y,
censor = NULL,
model = NULL,
ntrees = if (reinforcement) 100 else 500,
mtry = max(1, as.integer(ncol(x)/3)),
nmin = max(1, as.integer(log(nrow(x)))),
split.gen = "random",
nsplit = 1,
resample.replace = TRUE,
resample.prob = if (resample.replace) 1 else 0.8,
resample.preset = NULL,
obs.w = NULL,
var.w = NULL,
importance = FALSE,
reinforcement = FALSE,
param.control = list(),
ncores = 0,
verbose = 0,
seed = NULL,
...
)
Arguments
- x
A
matrix
ordata.frame
of features. Ifx
is a data.frame, then all factors are treated as categorical variables, which will go through an exhaustive search of splitting criteria.- y
Response variable. a
numeric
/factor
vector.- censor
Censoring indicator if survival model is used.
- model
The model type:
"regression"
,"classification"
,"quantile"
,"survival"
or"graph"
.- ntrees
Number of trees,
ntrees = 100
if reinforcement is used andntrees = 1000
otherwise.- mtry
Number of randomly selected variables used at each internal node.
- nmin
Terminal node size. Splitting will stop when the internal node size is less equal to
nmin
.- split.gen
How the cutting points are generated:
"random"
,"rank"
or"best"
. If minimum child node size is enforced (alpha
$> 0$), then"rank"
and"best"
should be used.- nsplit
Number of random cutting points to compare for each variable at an internal node.
- resample.replace
Whether the in-bag samples are obtained with replacement.
- resample.prob
Proportion of in-bag samples.
- resample.preset
A pre-specified matrix for in-bag data indicator/count matrix. It must be an \(n \times\)
ntrees
matrix with integer entries. Positive number indicates the number of copies of that observation (row) in the corresponding tree (column); zero indicates out-of-bag; negative values indicates not being used in either. Extremely large counts should be avoided. The sum of each column should not exceed \(n\).- obs.w
Observation weights. The weights will be used for calculating the splitting scores, such as a weighted variance reduction or weighted gini index. But they will not be used for sampling observations. In that case, one can pre-specify
resample.preset
instead for balanced sampling, etc. For survival analysis, observation weights are not implemented in the"logrank"
or"suplogrank"
tests, due to the difficulty of calculating the variance of test statistic. However, it is used in the"coxgrad"
splitting rule. For other models, this feature is currently not available.- var.w
Variable weights. If this is supplied, the default is to perform weighted sampling of
mtry
variables. For other usage, see the details ofsplit.rule
inparam.control
.- importance
Whether to calculate variable importance measures. When set to
"TRUE"
(or"permute"
), the calculation follows Breiman's original permutation strategy. If set to"distribute"
, then it sends the oob data to both child nodes with weights proportional to their sample sizes. Hence the final prediction is a weighted average of all possible terminal nodes that a perturbed observation could fall into. This feature is currently only available in regression and classification models.- reinforcement
Should reinforcement splitting rule be used. Default is
"FALSE"
, i.e., regular random forests with marginal search of splitting variable. When it is activated, an embedded model is fitted to find the best splitting variable or a linear combination of them, iflinear.comb
$> 1$. They can also be specified inparam.control
.- param.control
A list of additional parameters. This can be used to specify other features in a random forest or set embedded model parameters for reinforcement splitting rules. Using
reinforcement = TRUE
will automatically generate some default tuning for the embedded model. This mode is currently only available in regression. They are not necessarily optimized.embed.ntrees
: number of trees in the embedded modelembed.mtry
: number or proportion of variablesembed.nmin
: terminal node sizeembed.split.gen
random cutting point search method ("random"
,"rank"
or"best"
)embed.nsplit
number of random cutting pointsembed.resample.replace
whether to sample with replacementembed.resample.prob
: proportion of samples (of the internal node) in the embedded modelembed.mute
muting rateembed.protect
number of protected variablesembed.threshold
threshold, as a fraction of the best VI, for being included in the protected set at an internal node.
\code{linear.comb} is a separate feature that can be activated with or without using reinforcement. It creates linear combination of features as the splitting rule. Currently only available for regression. \itemize{ \item In reinforcement mode, a linear combination is created using the top continuous variables from the embedded model. If a categorical variable is the best, then a regular split will be used. The splitting point will be searched based on \code{split.rule} of the model. \item In non-reinforcement mode, a marginal screening is performed and the top features are used to construct the linear combination. This is an experimental feature. } \code{split.rule} is used to specify the criteria used to compare different splittings. Here are the available choices. The first one is the default: \itemize{ \item Regression: `"var"` (variance reduction); `"pca"` and `"sir"` can be used for linear combination splits \item Classification: `"gini"` (gini index) \item Survival: `"logrank"` (log-rank test), `"suplogrank"`, `"coxgrad"`. \item Quantile: `"ks"` (Kolmogorov-Smirnov test) \item Graph: `"spectral"` (spectral embedding with variance reduction) } \code{resample.track} indicates whether to keep track of the observations used in each tree. \code{var.ready} this is a feature to allow calculating variance (hence confidence intervals) of the random forest prediction. Currently only available for regression (Xu, Zhu & Shao, 2023) and confidence band in survival models (Formentini, Liang & Zhu, 2023). Please note that this only perpares the model fitting so that it is ready for the calculation. To obtain the confidence intervals, please see the prediction function. Specifying \code{var.ready = TRUE} has the following effect if these parameters are not already provided. For details of their restrictions, please see the orignal paper. \itemize{ \item \code{resample.preset} is constructed automatically \item \code{resample.replace} is set to `FALSE` \item \code{resample.prob} is set to \eqn{n / 2} \item \code{resample.track} is set to `TRUE` } It is recommended to use a very large \code{ntrees}, e.g, 10000 or larger. For \code{resample.prob} greater than \eqn{n / 2}, one should consider the bootstrap approach in Xu, Zhu & Shao (2023). \code{alpha} force a minimum proportion of samples (of the parent node) in each child node. \code{failcount} specifies the unique number of failure time points used in survival model. By default, all failure time points will be used. A smaller number may speed up the computation. The time points will be chosen uniformly on the quantiles of failure times, while must include the minimum and the maximum.
- ncores
Number of CPU logical cores. Default is 0 (using all available cores).
- verbose
Whether info should be printed.
- seed
Random seed number to replicate a previously fitted forest. Internally, the
xoshiro256++
generator is used. If not specified, a seed will be generated automatically and recorded.- ...
Additional arguments.
Value
A RLT
fitted object, constructed as a list consisting
FittedForestFitted tree structures
VarImpVariable importance measures, if
importance = TRUE
PredictionOut-of-bag prediction
ErrorOut-of-bag prediction error, adaptive to the model type
ObsTrackProvided if
resample.track = TRUE
,var.ready = TRUE
, or ifresample.preset
was supplied. This is ann
\(\times\)ntrees
matrix that has the same meaning asresample.preset
.
For classification forests, these items are further provided or will replace the regression version
NClassThe number of classes
ProbOut-of-bag predicted probability
For survival forests, these items are further provided or will replace the regression version
timepointsordered observed failure times
NFailThe number of observed failure times
PredictionOut-of-bag prediciton of hazard function
References
Zhu, R., Zeng, D., & Kosorok, M. R. (2015) "Reinforcement Learning Trees." Journal of the American Statistical Association. 110(512), 1770-1784.
Xu, T., Zhu, R., & Shao, X. (2023) "On Variance Estimation of Random Forests with Infinite-Order U-statistics." arXiv preprint arXiv:2202.09008.
Formentini, S. E., Wei L., & Zhu, R. (2022) "Confidence Band Estimation for Survival Random Forests." arXiv preprint arXiv:2204.12038.