Randomization as Regularization: A Degrees of Freedom Explanation for Random Forest Success

  • 2019-11-01 03:13:12
  • Lucas Mentch, Siyu Zhou
  • 14

Abstract

Random forests remain among the most popular off-the-shelf supervised machinelearning tools with a well-established track record of predictive accuracy inboth regression and classification settings. Despite their empirical success aswell as a bevy of recent work investigating their statistical properties, afull and satisfying explanation for their success has yet to be put forth. Herewe aim to take a step forward in this direction by demonstrating that theadditional randomness injected into individual trees serves as a form ofimplicit regularization, making random forests an ideal model in lowsignal-to-noise ratio (SNR) settings. Specifically, from a model-complexityperspective, we show that the mtry parameter in random forests serves much thesame purpose as the shrinkage penalty in explicitly regularized regressionprocedures like lasso and ridge regression. To highlight this point, we designa randomized linear-model-based forward selection procedure intended as ananalogue to tree-based random forests and demonstrate its surprisingly strongempirical performance. Numerous demonstrations on both real and synthetic dataare provided.

 

Quick Read (beta)

Randomization as Regularization: A Degrees of Freedom Explanation for Random Forest Success

Lucas Mentch and Siyu Zhou
 
Department of Statistics
University of Pittsburgh
Abstract

Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well as a bevy of recent work investigating their statistical properties, a full and satisfying explanation for their success has yet to be put forth. Here we aim to take a step forward in this direction by demonstrating that the additional randomness injected into individual trees serves as a form of implicit regularization, making random forests an ideal model in low signal-to-noise ratio (SNR) settings. Specifically, from a model-complexity perspective, we show that the mtry parameter in random forests serves much the same purpose as the shrinkage penalty in explicitly regularized regression procedures like lasso and ridge regression. To highlight this point, we design a randomized linear-model-based forward selection procedure intended as an analogue to tree-based random forests and demonstrate its surprisingly strong empirical performance. Numerous demonstrations on both real and synthetic data are provided.


Keywords: Regularization, Bagging, Degrees of Freedom, Model Selection, Interpolation

1 Introduction

Despite being proposed nearly two decades ago, random forests (Breiman, 2001) remain among the most popular and successful off-the-shelf supervised learning methods with demonstrated success in numerous scientific domains from ecology (Prasad et al., 2006; Cutler et al., 2007; Coleman et al., 2017) to image recognition (Bernard et al., 2007; Huang et al., 2010; Guo et al., 2011; Fanelli et al., 2013) to bioinformatics (Díaz-Uriarte and De Andres, 2006; Mehrmohamadi et al., 2016). In comparison with many competing methods, random forests are generally quite competitive in terms of predictive accuracy despite being largely considered simpler and more computationally tractable. A recent large-scale empirical study by Fernández-Delgado et al. (2014) found them to be the top performing classifier when compared across 100’s of alternatives across 100’s of datasets. Though Breiman once famously crowned Adaboost (Freund et al., 1996) the best “off-the-shelf classifier in the world”, random forests have since taken that spot in the minds of many.

The sustained success random forests has led naturally to the desire to better understand the statistical and mathematical properties of the procedure. Lin and Jeon (2006) introduced the potential nearest neighbor framework and Biau and Devroye (2010) later established related consistency properties. Biau et al. (2008), Biau (2012), and Klusowski (2019b) analyzed convergence properties and rates for idealized versions of the procedure. Methods for obtaining estimates of the standard errors of random forest predictions were developed in Sexton and Laake (2009) and Wager et al. (2014) while algorithmic convergence studies determining how many trees are necessary for stabilization were provided by Lopes et al. (2019a) and Lopes et al. (2019b). The out-of-bagg variable importance measures originally suggested by Breiman (2001) have since been analyzed thoroughly and shown to be biased towards correlated features and those with many categories (Strobl et al., 2007, 2008; Toloşi and Lengauer, 2011; Nicodemus et al., 2010). Hooker and Mentch (2019) recently provided an explanation for this kind of behavior. The core random forest methodology has been extended to numerous frameworks including quantile regression (Meinshausen, 2006), reinforcement learning (Zhu et al., 2015), and survival analysis (Hothorn et al., 2005; Ishwaran et al., 2008; Cui et al., 2017; Steingrimsson et al., 2019). For a more comprehensive overview of general random-forest-related research, we refer the reader to a recent review paper from Biau and Scornet (2016).

In the last several years, a number of important statistical properties of random forests have also been established whenever base learners are constructed with subsamples rather than bootstrap samples. Scornet et al. (2015) provided the first consistency result for Breiman’s original random forest algorithm whenever the true underlying regression function is assumed to be additive. Mentch and Hooker (2016) provided the first distributional results for random forests, demonstrating that predictions are asymptotically normal under various regularity conditions and proposing accompanying procedures for producing confidence intervals and hypothesis tests for feature importance. Mentch and Hooker (2017) extended this idea to grid-structured testing to allow for hypothesis tests for additivity in the regression function. Coleman et al. (2019) proposed a more computationally efficient permutation-based hypothesis test allowing such procedures to scale easily to big data settings where large test sets may be utilized. Wager and Athey (2018) established both consistency and asymptotic normality for honest and regular trees whenever many trees are constructed. Peng et al. (2019) weakened the conditions necessary for asymptotic normality in a more general setting and provided Berry-Esseen Theorems describing their rate of convergence.

Despite the impressive volume of research from the past two decades and the exciting recent progress in establishing their statistical properties, a satisfying explanation for the sustained empirical success of random forests has yet to be provided. In their recent review, Biau and Scornet (2016) note that research investigating the properties of random forest tuning parameters is “unfortunately rare” and that “present results are insufficient to explain in full generality the remarkable behavior of random forests.” Indeed, while numerous studies have experimented with various tuning parameter setups and strategies (Genuer et al., 2008; Bernard et al., 2009; Genuer et al., 2010; Duroux and Scornet, 2016; Scornet, 2017; Probst and Boulesteix, 2017; Probst et al., 2019), results from these works have largely been limited to high-level, somewhat heuristic takeaways with most concluding only that building more trees helps to stabilize predictions and that the other tuning parameters can have a substantial effect on accuracy. Other work has attempted to garner insight by drawing connections between tree-based ensembles and other related procedures such as kernel methods (Scornet, 2016; Olson and Wyner, 2018) or neural networks (Welbl, 2014; Biau et al., 2016). However, to our knowledge, besides the original motivation provided by Breiman (2001), the only other work devoted to attempting to explain the success of random forests was provided recently by Wyner et al. (2017) who hypothesized that their strong performance was due to their behavior as “self-averaging interpolators.” Belkin et al. (2019) very recently wrote in support of this notion. However, as expanded upon in detail in the following section, in our view, these existing explanations fall short on a number of fronts. In general, the arguments provided apply equally well to other learning methods not generally thought to have strong and robust empirical performance, fail to provide insight into the role played by the randomness, and apply only to particular problem setups.

In contrast, the work presented here offers a concrete explanation for the role played by the randomness most commonly injected into random forest procedures. Instead of assuming that random forests simply do “work well”, we take a more principled approach in trying to isolate the effects of that randomness and determine when its inclusion results in improved accuracy over a baseline approach like bagging. In particular, we argue that the randomness serves to regularize the procedure, making it highly advantageous in low signal-to-noise ratio settings. To drive home this point, we further demonstrate that incorporating similar randomness into alternative (non tree-based) model selection procedures can result in improved predictive accuracy over existing methods in exactly the settings where such improvements would be expected. This approach is inspired largely by recent work on degrees of freedom for model selection by Tibshirani (2015) and Hastie et al. (2017).

The remainder of this paper is laid out as follows. In Section 2 we formalize the random forest procedure and continue the above discussion, providing a more detailed overview of the shortcomings of existing explanations for their success. In Section 3 we discuss degrees of freedom for model selection procedures and demonstrate that within a traditional random forest context, more randomness results in procedures with fewer degrees of freedom. We emphasize and build upon this finding in Section 4 by demonstrating in numerous settings using both real and synthetic data that the relative improvement in accuracy offered by random forests appears directly related to the relative amount of signal contained within the data. Finally, in Section 5 we develop linear model forward-selection-style analogues for both bagging and random forests and produce near identical results, finding in particular that in highly noisy low-dimensional settings, injecting randomness into the selection procedure can outperform even highly competitive explicit regularization methods such as the lasso.

2 Random Forests and Existing Explanations

We begin by formalizing the random forest framework in which we will work in the following sections. Unless otherwise noted, throughout the remainder of this paper we will consider a general regression framework in which we observe (training) data of the form 𝒟n={Z1,,Zn} where each Zi=(𝑿i,Yi), 𝑿i=(X1,i,,Xp,i) denotes a vector of p features, Y denotes the response, and the variables have a general relationship of the form

Y=f(X)+ϵ (1)

where ϵ is often assumed to be independent noise with mean 0 and variance σ2. For a given point z=(x,y), a random forest prediction at x takes the form

y^=RF(x;𝒟n,Θ)=1Bi=1BT(x;𝒟n,Θi) (2)

where the base-learners T are typically tree-based models constructed on some resample of the original data 𝒟n. In the work that follows we assume these base models are regression trees built according to the standard CART criterion (Breiman et al., 1984) whereby an internal node t is split in an axis-aligned fashion into left and right daughter nodes of the form tL={𝒙t:xjs} and tR={𝒙t:xj>s} whenever the decision is made to split the feature Xj at s. The particular variable and split location are chosen from among those available as that pair which minimizes the resulting within-node variance of the offspring. For a more detailed discussion of this procedure, we refer the reader to recent work by Scornet et al. (2015) or Klusowski (2019a).

Throughout the literature on random forests, it is common to succinctly contain all randomness in the single term Θi as in equation (2) above. We note however that for our purposes below, it may be convenient to consider this more explicitly as Θi=(Θ𝒟,i,Θmtry,i). Written in this form, Θ𝒟,i serves to select the resample of the original data utilized in the ith tree. While much recent work has focused on subsampled random forests, here we consider the B resamples to be bootstrap samples as originally put forth in Breiman (2001). The second randomization component Θmtry,i then determines the eligible features to be split at each node in the ith tree. As is standard, we assume that at each internal node, the split must be be chosen from among only 𝚖𝚝𝚛𝚢p candidate features and that such candidates are selected uniformly at random without replacement. When 𝚖𝚝𝚛𝚢=p, the procedure reduces to bagging (Breiman, 1996).

2.1 Explanations for Random Forest Success

The original reasoning behind random forests provided by Breiman (2001) was based on an extension of the randomized tree analysis given in Amit and Geman (1997). Breiman showed that the accuracy of any randomized ensemble depends on two components: the strength (accuracy) of the individual base-learners and the amount of dependence between them. Thus, the original motivation for a procedure like random forests might be seen from a statistical perspective as akin to the classic bias-variance tradeoff. In the same way that some procedures (e.g. the lasso (Tibshirani, 1996; Chen et al., 2001)) consider sacrificing a small amount of bias in exchange for a large reduction in variance, random forest ensembles might be seen as sacrificing a small amount of accuracy at the base-learner level (by injecting the extra randomness) for a large reduction in between-tree correlation. Hastie et al. (2009) provide a thorough, high-level discussion of this effect in showing that the mtry parameter serves to reduce the variance of the ensemble.

However, this discussion from Breiman (2001) might be better seen as motivation for why a randomized ensemble could potentially improve accuracy rather than an explanation as to why random forests in particular do seem to work well. Breiman himself experiments with different kinds of randomness in the original manuscript and suggests that in practice users can also experiment with different forms to try and determine what works best in particular settings. Furthermore, in his concluding remarks, Breiman notes that while the additional randomness at the base learner level helps to reduce the variance of the ensemble, the magnitude of improvement often seen with random forests suggested to him that perhaps it somehow also “act[s] to reduce bias” but that ultimately “the mechanism for this [was] not obvious.” In the years since, it has been shown quite clearly that the benefits sometimes seen with random forests are the result of variance reduction alone; see Hastie et al. (2009) for a more complete discussion.

In recent work, Wyner et al. (2017) take a more definitive stance, conjecturing that both random forests and AdaBoost (Freund et al., 1996) work well because both procedures are “self-averaging interpolators” that fit the training data perfectly while retaining some degree of smoothness due to the averaging. The key to their success, they argue, is that in practice, datasets often contain only small amounts of noise and these algorithms are able to mitigate the effects of noisy data points by localizing their effect so as to not disturb the larger regions where the data consists mostly of signal. Indeed, the authors acknowledge that the procedures “do in fact overfit the noise – but only the noise. They do not allow the overfit to metastasize to modestly larger neighborhoods around the errors.”

2.2 Random Forests and Interpolation

As the central claim of Wyner et al. (2017) is that random forests “work not in spite, but because of interpolation” we now make this notion and argument explicit.

Definition 1 (Interpolation).

A classifier (or regressor) f^ is said to be an interpolating classifier (regressor) if for every training point (𝐱j,yj)Dn, f^(𝐱j)=yj.

This definition of an interpolating classifier is taken directly from Wyner et al. (2017); for completeness and because it will be directly relevant to the immediate conversation, we expand the definition so as to apply in the same fashion to regression contexts.

Consider first the classification setting wherein the response Y{a1,,ak} and we seek to utilize the training data 𝒟n to construct a classifier f^n:𝒳{a1,,ak}. The random forest procedure laid out above for regression can be easily amended to perform classification – indeed, the original procedure provided in Breiman (2001) was discussed largely for classification problems. Here, split locations within trees are chosen as those which provide the greatest empirical reduction in Gini impurity and final predictions are taken as the majority vote – the class predicted most often – across all trees in the forest.

Figure 1: (Left): Interpolation probability vs. number of trees (B) for a single observation. (Right): Interpolation probabilities (upper bounds) vs. sample sizes for random forest classifiers built with different numbers of trees.

Now consider a particular observation z=(x,y)𝒟n. In order to be more explicit and without loss of generality, suppose that y=a1. Given B resamples of the original data 𝒟1*,,𝒟B*, whenever trees are fully grown so that each terminal node contains only a single observation, it must necessarily be the case that T(x;𝒟i*,Θi)=a1 (i.e. the tree predicts the correct class a1 for x) whenever (x,y)𝒟i*. Thus, in order for the random forest classifier to select the correct class for x, (i.e. RF(x)=a1) it suffices to ensure that (x,y) is selected in a plurality (or simple majority in the case of binary classification) of the resamples. When bootstrap samples are used, it is well known that the probability of each observation appearing is approximately 0.632. Thus, given B bootstrap samples, the probability that the observation (x,y) appears in at least half of these resamples is approximately

pint(B)=1-Bin(B/2;n=B,p=0.632)

where Bin(z;n,p) denotes the binomial cdf evaluated at the point z with parameters n (the number of trials) and p (the probability of success in each trial). For even moderately large B, this probability is quite large; see the left plot in Figure 1. Thus, for binary classification problems, the probability of interpolating any given training observation is large whenever B is also moderately large.

Note however that according to the definition above, a classifier is only designated as an interpolator if it interpolates all training observations. While an exact calculation for the probability of all n points appearing in at least half of the bootstrap samples is somewhat involved, (pint(B))n serves as a simple upper bound. Plots of these upper bounds on the interpolation probability are shown in the right plot in Figure 1 across a range of sample sizes for B=100,150,and 200. In each case, for fixed B, the probability that the classifier is an interpolator tends to 0 as n and thus, in order for random forest classifiers to necessarily interpolate, the number of bootstrap replicates B must be treated as an increasing function of n. Though perhaps obvious, we stress that this is not generally the manner in which such estimators are constructed. In all software with which we are familiar, default values of B are set to a fixed moderate size, usually on the order of a few hundred, independent of the size of the training set. Recent work from Lopes et al. (2019b) has also provided a means for estimating the algorithmic variance of random forest classifiers and shown that such variance sometimes vanishes quite quickly after relatively few bootstrap samples. Thus, while it’s possible to construct random forests in such a way that they necessarily interpolate with high probability in classification settings, it is not clear that random forests would generally be constructed in this fashion in practice and thus it is also not clear that the interpolation-based explanation offered by Wyner et al. (2017) is sufficient to explain the strong performance of random forests, even in specific contexts. It is also worth noting that on certain datasets where random forests happen to produce good models with low generalization error, they may likely also fit quite well on the training data, perhaps even nearly interpolating. This, however, is certainly possible for any modeling procedure and thus in no way would aid in explaining the particular success of random forests.

2.3 Shortcomings of Current Explanations

Before continuing with our critique, it’s worth pausing to note where these existing explanations are in agreement. Both Breiman (2001) and Wyner et al. (2017) seem to largely agree on the following points:

  1. 1.

    Random forests and boosting behave in a very similar fashion and thus their success should be able to be explained in a very similar fashion [Breiman (2001) pages 6, 20, Section 7; Wyner et al. (2017) entire paper].

  2. 2.

    Random forests and boosting generally outperform most other competing methods (e.g. Dietterich (2000) and Breiman (2000)) in terms of minimizing generalization error and substantially outperform bagging. [Breiman (2001) page 10; Wyner et al. (2017) page 3].

  3. 3.

    Random forests generally seem to be robust to outliers and noise [Breiman (2001) pages 10, 21; Wyner et al. (2017) pages 4, 12, 17, 20, 32, 35].

  4. 4.

    Boosting tends to perform well and not overfit even when the ensemble consists of many deep trees [Breiman (2001) page 21; Wyner et al. (2017) page 8].

Points 1 and 4 are largely irrelevant to the discussion in the remainder of this paper as we focus exclusively on random forests; we include these points here only in the interest of completeness. Point 3 has been alluded to in numerous papers throughout the years and has likely been key to the sustained popularity of the random forest procedure.

We take slight issue with the now popular wisdom in point 2, that random forests simply “are better” than bagging or other similar randomized approaches. While this does seem to be the case surprisingly often in practice on real-world datasets (see, for example, the recent large-scale comparison form Fernández-Delgado et al. (2014) discussed in the introduction) it is certainly not a universal truth and in our view, is a potentially naive foundation on which to build a theory for explaining their success. As discussed above, Breiman (2001) does provide some motivation for why a randomized ensemble might sometimes outperform its nonrandomized counterpart in showing that the generalization error of a classifier can be bounded above by a function of base-learner accuracy and correlation. Breiman stops short, however, of providing any more explicit explanation for the role played by the randomness or in what situations that randomness might be expected to help the most. Wyner et al. (2017), on the other hand, seem to largely ignore the role of randomness altogether. The explanation the authors provide for random forest success would seem to apply equally well to bagging. In the sections below, we focus our attention heavily on determining when the inclusion of such randomness provides the greatest benefit and provide an explicit characterization of the role it plays.

It is also important to stress that the theories offered by Breiman (2001) and Wyner et al. (2017) pertain only to the classification setting, whereas our focus is primarily on regression. The interpolation hypothesis put forth by Wyner et al. depends on an even stricter setup whereby trees are built to full depth, bootstrapping (or at least subsampling without replacement at a rate of at least 0.5n) is used to generate resamples, and the number of trees B grows with n at a sufficiently fast rate. Previous work, however, has repeatedly shown that random forests can still achieve a high degree of predictive accuracy when trees are not built to full depth (Duroux and Scornet, 2016), and/or are constructed via subsampling (Zaman and Hirose, 2009; Mentch and Hooker, 2016; Wager and Athey, 2018), and/or when relatively few trees are built (Lopes et al., 2019b).

Figure 2: Left: Two interpolating regressors on toy data. Right: An interpolating regressor and the predicted regression function resulting from bagging with 100 fully-grown trees.

To see why random forests cannot be considered interpolators in a regression setting, even when individual trees are built to full depth, note that the final regression estimate is taken as the average of predictions across all trees rather than the majority vote. While it’s certainly clear that an average of interpolators is itself an interpolator since for every training point (x,y)

1Bi=1Bf^i(x)=1Bi=1By=y,

the bootstrapping mechanism in play with random forests precludes the possibility of interpolating on 𝒟n with exceedingly high probability. As noted above, each bootstrap sample will omit, on average, 36.8% of the original observations and thus individual trees, even if fully grown, will not, in general, fit perfectly on that out-of-sample (out-of-bagg) data. In other words, while a fully-grown tree will necessarily interpolate the observations selected within its corresponding bootstrap sample, it will generally not fit perfectly to all training observations. For a given point (x,y) in the training data, random forest regression estimates at x are therefore a weighted average of y and the other response values observed, and hence with exceedingly high probability, the random forest will not interpolate.

Figure 2 demonstrates this effect clearly. The left panel shows two generic interpolating functions – Interpolator A and Interpolator B – on a simple toy dataset while the right panel shows one of the same interpolators along with the predicted regression function resulting from bagging with 100 regression trees, each grown to full depth. In this toy example (motivated by the examples shown in Section 3.2 of Wyner et al. (2017)), our training data consists of 11 points each of the form (i,0) for i=1,,11 except for i=6 where we instead have the observation (6,1). Denote the location of this point by x*. Wyner et al. (2017) contend that if the observed response is considered “unusually noisy” at x*, then interpolating estimators can perform well by “localizing” the effect of this noise as is seen in the left panel of Figure 2. Indeed, we can see from this plot that both interpolators still fit the remaining data perfectly well despite the presence of the noisy observation. However, as can be seen in the right-hand panel, whenever we treat this as a regression problem and build trees to full depth, the random forest (in this case, simple bagging since we have only 1 feature) does not interpolate, but instead looks to be attempting to smooth-out the effect of the outlying observation.

This, however, stands in opposition to the reasoning provided in Wyner et al. (2017). Here the authors are highly critical of the traditional statistical notion of signal and noise and seem to take some issue with the general regression setup given in (1). To computer scientists, they claim, in many problems “there is no noise in the classical sense. Instead there are only complex signals. There are residuals, but these do not represent irreducible random errors.” But if the widespread empirical success of random forests is really “not in spite, but because of interpolation” as claimed, then one must believe that real-world data is generally low-noise, a claim argued against firmly by, for example, Hastie et al. (2017). Crucially, this means that not only are data largely free of what scientists may think of as classical kinds of noise like measurement error, but also that all sources of variation in the response Y can be explained almost entirely by the available set of predictor variables X1,,Xp.

Perhaps most importantly, if the success of random forests is the result of interpolation and interpolation is beneficial because real-world data has very high signal-to-noise ratios (SNRs), then random forests ought not perform well at all on datasets with low SNRs. Consider again the plot on the left-hand-side of Figure 2. If the outlying point x* at (6,1) is actually the only “good signal” while the rest of the data are noisy, then the interpolators shown would be isolating signal rather than noise and hence be performing quite poorly.

But this is exactly the opposite of what we see with random forests in practice. In the following sections, we show repeatedly on both real and synthetic data that relative to non-randomized procedures like bagging, the benefit of the additional randomness is most apparent in low SNR settings. In Section 3 we show that the mtry parameter has a direct effect on the number of degrees of freedom (dof) associated with the procedure, with low values of mtry (i.e. more randomness) corresponding to the least flexible model forms with the fewest dof. Given this, in Section 4 we go on to show that as expected, the advantage offered by random forests is most dramatic at low SNRs, and that this advantage is eventually lost to bagging at high SNRs. We also consider the problem from a slightly different perspective and show that the optimal value of mtry is almost perfectly (positively) correlated with the SNR. We posit that this behavior is due to a regularizing effect caused by the randomness and bolster this claim by demonstrating the same, relatively surprising results hold in simpler linear model setups where randomness is injected into a forward selection process.

3 Random Forests and Degrees of Freedom

Recall from the previous section that we assume data of the form 𝒟n={Z1,,Zn} where each Zi=(𝑿i,Yi), 𝑿i=(X1,i,,Xp,i) denotes a vector of p features, Y denotes the response, and the variables have a general relationship of the form Y=f(X)+ϵ. Assume further that the errors ϵ1,,ϵn are uncorrelated with mean 0 and (common) variance σ2. Given a particular regression estimate f^ that produces fitted values y^1,,y^n, the degrees of freedom (Efron, 1986; Efron and Tibshirani, 1990; Tibshirani, 2015) of f^ is defined as

df(f^)=1σ2i=1nCov(y^i,yi). (3)

The degrees of freedom (dof) of a particular estimator is generally understood as a measure of its flexibility; estimators with high dof depend more heavily on the particular values observed in the original data and hence are higher variance. Understanding the dof associated with various estimation procedures can provide valuable insights into their behavior as well as the situations when they might be expected to perform better or worse relative to a set of alternative methods. Tibshirani (2015) took an important step in this regard, showing that adaptive procedures like best subset selection (BSS) and forward stepwise selection (FS), even when selecting a model with k terms, had more than k dof because of the increased dependence on the data incurred through the selection aspect. These additional dof were coined the search degrees of freedom. More recently, Hastie et al. (2017) provided a thorough collection of simulations to demonstrate the predictive advantages of regularized procedures like the lasso and relaxed lasso over BSS and FS, especially in low signal-to-noise ratio (SNR) settings where the SNR is defined as

SNR=Var(f(x))Var(ϵ).

Much of the work in the following sections was inspired by the approach taken in Hastie et al. (2017) and various portions of the work below follow closely to the setups considered there.

We begin our work by estimating the dof of random forests under various values of mtry. In linear model contexts, the dof for different estimators is generally shown by plotting the estimated dof against the average number of nonzero coefficients in the selected models. In our context with tree-based estimators, we use maxnodes – the maximum number of terminal nodes that any tree within the forest can have – as an analogue. For any given forest with fixed value of mtry, we should expect that as trees are allowed to grow deeper (i.e. maxnodes takes larger values), the estimators should become more sensitive to the data and hence incur higher dof.

We consider two general model forms: a linear model

Y=Xβ+ϵ=X1β1++Xpβp+ϵ

and the model

Y=0.1e4X1+41+e-20(X2-0.5)+3X3+2X4+X5+ϵ,

which we refer to as ‘MARSadd’ as it is additive and first appeared in the work on Multivariate Adaptive Regression Splines (MARS) by Friedman (1991). Features in the MARSadd model are sampled independently from Unif(0,1). For the linear model, in line with Hastie et al. (2017), we consider three different settings:

  • Low: n=100, p=10, s=5

  • Medium: n=500, p=100, s=5

  • High-10: n=100, p=1000, s=10

where n is the total (training) sample size, p denotes the total number of features generated, and s<p is the number of features with a nonzero coefficient thus considered signal. Rows of Xn×p are independently drawn from N(0,Σ), where Σp×p has entry (i,j) = ρ|i-j|. We take ρ=0.35 and set the first s components of β equal to 1 with the rest set to 0. This setup corresponds to the general sampling scheme and ‘beta-type 2’ setting from Hastie et al. (2017). For both models here as well as throughout the majority of the remainder of this paper, we consider sampling the noise as ϵN(0,σ2I) where σ2 is chosen to produce a corresponding SNR level ν, so that, for example, in the linear model case, we take

σ2=βTΣβν.

Finally, in most previous literature, 𝚖𝚝𝚛𝚢p+ denotes the number of features eligible for splitting at each node. Here and throughout the remainder of the paper, we adopt a slightly different (but equivalent) convention by defining mtry as the proportion of eligible features so that 𝚖𝚝𝚛𝚢(0,1]. This is purely for readability and ease of interpretation so that readers may more immediately see whether the number of available features is large or small (relative to p) regardless of the particular model setup being considered.

Figure 3: Degrees of freedom for random forests at different levels of mtry.

Results are shown in Figure 3. In each of the four model setups we take the SNR to be equal to 3.52 and estimate the dof for random forests with mtry equal to 0.1, 0.33, 0.67, and 1. Note that when 𝚖𝚝𝚛𝚢=1 the model reduces to bagging (Breiman, 1996) and 𝚖𝚝𝚛𝚢=0.33 corresponds to the standard choice of p/3 eligible features at each node in regression settings, as is the default in most software. The forests are constructed using the randomForest package in R (Liaw et al., 2002) with the default settings for all arguments except for mtry and maxnodes. Each point in each plot in Figure 3 corresponds to a Monte Carlo estimate of the dof formula given in (3) evaluated over 500 iterations.

Several clear patterns are apparent in Figure 3. First and perhaps most obviously, as conjectured above, in each case we see that the dof increases as maxnodes increases and trees can be grown to a greater depth. Each plot also shows the same general concave increasing shape for each forest. Furthermore, the estimated dof function for each forest lies above the diagonal (shown as a dotted line in each plot), supporting the general notion formalized in Tibshirani (2015) that adaptive procedures like the tree-based models employed here incur additional dof as a result of this search.

More importantly for our purposes, in each plot in Figure 3 we see that at every fixed level of maxnodes, the dof increases with mtry. In particular, bagging (𝚖𝚝𝚛𝚢=1) always contains more dof than the default implementation for random forest regression (𝚖𝚝𝚛𝚢=0.33). Finally, we note that the patterns seen in these plots also hold for numerous other regression functions and SNRs that were experimented with; Figure 9 in Appendix A shows the results of the same experiments above carried out at a much lower SNR of 0.09 and the findings are nearly identical.

4 Random Forest Performance vs Signal-to-Noise Ratio

The empirical results in the preceding section suggest that the mtry parameter in random forests is directly tied to its dof with larger values resulting in estimators with higher dof and more flexibility. Based on this intuition and the results for linear estimators provided in Hastie et al. (2017), we should therefore expect that random forests with smaller values of mtry to perform well in noisy settings while bagging should perform best – potentially even better than random forests – at high SNRs. We now investigate this more formally.

4.1 Relative Performance on Synthetic Data

We begin by comparing the relative performance of random forests and bagging on simulated data across a range of plausible SNRs. In addition to the linear model setup described in the previous section, we now include the additional regression function

Y=10sin(πx1x2)+20(x3-0.05)2+10x4+5x5+ϵ

which we refer to as ‘MARS’ because like the additive model used previously, it first appeared in the MARS paper (Friedman, 1991), though note that unlike the previous model, it contains explicit interactions between the features. This particular MARS model has proven popular in random forest publications in recent years, appearing, for example, in Biau (2012) and Mentch and Hooker (2016). In the simulation setups described below, we consider the medium setting for the linear model (n=500, p=100, s=5) and for the MARS model, we take p=s=5 and consider sample sizes of n=200,500, and 10000. Features for the linear model are generated in the same fashion as above and those in the MARS model are sampled independently from Unif(0,1).

As in the previous section, the variance σ2 of the noise term is chosen so as to induce particular SNRs. Here, following Hastie et al. (2017), we consider 10 SNR values ν=0.05,0.09,0.14,,6.00 equally spaced between 0.05 and 6 on the log scale. Forests are again constructed using the randomForest package with the default settings except in the case of bagging where the default value of mtry is changed so that all features are available at each split.

Here, in comparing the performance of “random forests” against “bagging”, we stress that we are merely assessing the difference in predictive accuracies between forests constructed with 𝚖𝚝𝚛𝚢=0.33 (traditional random forests) versus those built with 𝚖𝚝𝚛𝚢=1 (bagging). To compare the relative performance for a fixed model setup with fixed SNR, we first generate a training dataset and then evaluate the mean squared error (MSE) for both bagging and random forests on an independently generated test dataset consisting of 1000 observations. This entire process is then repeated 500 times for each setting and we record the average difference in accuracy (Error(Bagg)-Error(RF)) across these repititions.

Results are shown in Figure 4. Each plot shows the average difference in test errors versus the SNR; note that positive differences indicate that random forests (𝚖𝚝𝚛𝚢=0.33) are outperforming bagging (𝚖𝚝𝚛𝚢=1). In each of the four regression setups shown in Figure 4, the improvement in accuracy seen with random forests decreases as the SNR increases. For large SNR values, bagging eventually begins to outperform the traditional random forests. Note also that for the MARS function, random forests seem to retain their relative improvement longer (i.e. for larger SNRs) with larger training samples. Given these results, the conventional wisdom that random forests simply “are better” than bagging seems largely unfounded; rather, the optimal value of mtry seems to be a function of the SNR.

Figure 4: Differences in test errors between bagging and random forests. Positive values indicate better performance by random forests.

4.2 Optimal mtry vs SNR

The results above indicate that random forests (𝚖𝚝𝚛𝚢=0.33) generally seem to outperform bagging (𝚖𝚝𝚛𝚢=1) unless the SNR is large. We now reverse the direction of this investigation and estimate the optimal value of mtry across various SNR levels.

Here, as above, we consider both the MARS model and a linear model. In the same fashion as in previous simulations, the errors are sampled from a N(0,σ2) where σ2 is chosen to produce a particular SNR and we consider the same 10 SNR values as above. For the MARS model, we take p=s=5 and generate features independently from Unif(0,1) and for the linear model, we take p=20 and s=10 with features drawn from Np(0,Σ) where the (i,j) entry of Σ is given by ρ|i-j| with ρ=0.35. The first s=10 coefficients in the linear model are set equal to 1 with the rest set equal to 0.

For both models, we consider (training) sample sizes of both n=50 and n=500 and generate an independent test set of the same size. We construct forests using all possible values of mtry with the remaining options at the default settings in randomForest. The entire process is repeated 500 times and the mtry value corresponding to the forest with the lowest average test error for each setting is selected. The results are shown in Figure 5. As expected, corroborating the findings above, the optimal value of mtry increases with the SNR and the same general pattern emerges for both models and sample sizes. Figure 10 in Appendix A shows a slightly different calculation where the optimal mtry value on each of the 500 iterations is determined and the overall mean is then calculated. Here too we see exactly the same general pattern in that as the SNR increases, so does the optimal value of mtry.

Figure 5: Optimal value of mtry vs SNR for the MARS and linear model.

4.3 Relative Performance on Real Data

The work above presents strong empirical evidence that the relative improvement in predictive accuracy seen with random forests is largest at low SNRs and more generally, that the optimal value of mtry appears to be a direct (increasing) function of the SNR. These results, however, pertain only to those particular simulation settings that some may argue are highly idealized. Real-world data may contain far more complex relationships and thus we now explore whether the same general findings above also appear in more natural contexts.

Dataset p n
Abalone Age [abalone] (Waugh, 1995) 8 4177
Bike Sharing [bike] (Fanaee-T and Gama, 2014) 11 731
Bioston Housing [boston] (Harrison Jr and Rubinfeld, 1978) 13 506
Concrete Compressive Strength [concrete] (Yeh, 1998) 8 1030
CPU Performance [cpu] (Ein-Dor and Feldmesser, 1987) 7 209
Conventional and Social Movie [csm] (Ahmed et al., 2015) 10 187
Facebook Metrics [fb] (Moro et al., 2016) 7 499
Parkinsons Telemonitoring [parkinsons] (Tsanas et al., 2009) 20 5875
Servo System [servo] (Quinlan, 1993) 4 167
Solar Flare [solar] (Li et al., 2000) 10 1066
Aquatic Toxicity [AquaticTox] (He and Jurs, 2005) 468 322
Molecular Descriptor Influencing Melting Point [mtp2] (Bergström et al., 2003) 1142 274
Weighted Holistic Invariant Molecular Descriptor [pah] (Todeschini et al., 1995) 112 80
Adrenergic Blocking Potencies [phen] (Cammarata, 1972) 110 22
PDGFR Inhibitor [pdgfr] (Guha and Jurs, 2004) 320 79
Table 1: Summary of real-world data utilized. For datasets where no reference was specified, a reference to early work utilizing the data is given.

To investigate this, we utilize 10 datasets intended for regression from the UCI Machine Learning Repository (Dua and Graff, 2017). Because most of these datasets are low-dimensional, five additional high-dimensional datasets were also included, four of which were downloaded from openml.org (Vanschoren et al., 2013) with the other (AquaticTox) taken from the R package QSARdata. Summaries of these datasets are provided in Table 1. For datasets containing missing values (csm and fb), the corresponding rows of data were removed. Here we do not know the true SNR and thus to compare the relative performance of bagging and random forests, we inject additional random noise ϵ into the response, where each ϵN(0,σ2) and σ2 is chosen as some proportion α of the variance of the original response variable. We consider α=0,0.01,0.05,0.1,0.25 and 0.5 where α=0 corresponds to the case where no additional noise is added and performance is thus compared on the original data. To compare performance, we measure the relative test error (RTE)

RTE =Err^(Bagg)-Err^(RF)σ^y2×100% (4)

where Err^(Bagg) and Err^(RF) denote the 10-fold cross-validation error on bagging and random forests, respectively, and σ^y2 is the empirical variance of the original response. For each setting on each dataset, the process of adding additional random noise is replicated 500 times and the results are averaged. Once again, forests are constructed using the randomForest package with the default settings except for fixing 𝚖𝚝𝚛𝚢=1 for bagging and 𝚖𝚝𝚛𝚢=0.33 for random forests.

Results are shown in Figure 6 with low-dimensional datasets shown in the left plot and high-dimensional datasets shown on the right. Note that to aid in presentation, these display the shifted RTE rather than the raw calculation in (4). For a given proportion of additional noise α, let RTE(α) denote the corresponding relative test error. The shifted RTE at noise level α is then defined as RTE(α) - RTE(0). This ensures that the relative error for each dataset begins at the origin thereby allowing us to present all results in a single easy-to-interpret plot.

In both plots in Figure 6, the same general pattern appears as has been seen in the subsections above: as we increase the amount of additional noise inserted into the models, the relative improvement in predictive accuracy seen with random forests becomes more pronounced. The only slight exceptions are seen on the low-dimensional csm dataset and the high-dimensional pah dataset were the relative error appears to decrease by a small amount before eventually coming back up when large amounts of noise are added. It is not clear why the initial temporary drops occur in these two datasets, though it is worth noting that even the largest magnitudes of drops are quite small at approximately 0.2% and 0.13% in the csm and pah datasets, respectively.

Figure 6: Shifted RTE on real data where additional noise is added. The left plot shows results on low-dimensional datasets taken from the UCI repository; the right plot shows results on high-dimensional datasets.

5 Randomized Forward Selection

The results from the previous sections suggest that the optimal value of mtry for random forests is highly data-dependent, with smaller values preferred in noisy settings and bagging (𝚖𝚝𝚛𝚢=1) being preferred when very little noise is present. These findings are much in line with the classic understanding of random forests as a means by which the variance of the ensemble is reduced as a by-product of reducing the correlation between trees. The benefits of this variance reduction are most apparent at low SNRs.

In our view, however, this remains only a partial explanation. While randomizing the collection of features eligible for splitting at each node is one way to reduce between-tree correlation, it is certainly not the only means by which this can be accomplished. Breiman (2001) experimented with alternative implementations finding that even very naive approaches like adding random noise to the outputs of each tree could sometimes be beneficial. But as discussed in the opening sections, Breiman (2001), like many others after, also noted that the particular approach where features are randomly precluded from splitting at each node seemed to produce substantially more accurate predictions than other strategies for reducing between-tree correlation. Why this was the case, however, was not clear and has remained a subject of speculation in the nearly two decades following the inception of the procedure.

As already briefly mentioned above, Hastie et al. (2017) recently provided an extended comparison of several variable selection procedures including the lasso (Tibshirani, 1996; Chen et al., 2001), relaxed lasso (Meinshausen, 2007), forward stepwise selection (FS), and best subset selection (BSS). The relaxed lasso estimator utilized in Hastie et al. (2017) takes the form

β^relax(λ,γ)=γβ^lasso(λ)+(1-γ)β^LS|lasso(λ)

where β^LS|lasso(λ) denotes the vector of coefficient estimates obtained via least squares (LS) when computed on only those variables selected via the lasso and filled in with 0 for variables not selected. Perhaps the most striking takeaway from their study is that in low-dimensional settings where n>p, the more aggressive procedures (FS and BSS) with higher dof are generally not competitive with a regularized approach like the lasso at low SNRs but eventually produce more accurate predictions when the SNR becomes large. Relaxed lasso, taking the weighted average of lasso and LS-after-lasso estimates, possesses the ability to effectively trade-off the amount of regularization needed depending on the SNR and always seems to perform well.

For these kinds of estimators whose inner-workings are better understood, the reasoning behind the results observed is relatively straightforward. In low SNR settings, procedures like the lasso and relaxed lasso that explicitly regularize the problem can prevent overfitting to the noise by applying shrinkage to the coefficient estimates of the selected features. Given that we see the same general pattern here – random forests (𝚖𝚝𝚛𝚢=0.33) outperforming bagging (𝚖𝚝𝚛𝚢=1) except at high SNRs – it is reasonable to suspect that the additional randomness in random forests is playing a similar regularization role. Indeed, by randomly not allowing certain features to be split, random forests may be seen as effectively shrinking the potential influence of features, with the amount of shrinking being proportional to the amount of additional randomness added (with smaller values of mtry inducing more randomness). Speculation to this effect is described in Hastie et al. (2009).

Importantly however, if this is the kind of underlying effect that allows random forests to perform well in practice, such an effect should not be limited to tree-based ensembles. Indeed, if regression trees are seen as merely a complex form of forward selection, then if we were to create bagged and randomized versions of a standard forward selection procedure – analogues to the traditional tree-based versions of bagging and random forests – we should expect to see the same general patterns of relative improvement. In the following sections, we explore this idea and confirm that not only does a similar pattern emerge, but these ensemble-ized extensions of standard forward selection procedures exhibit surprisingly strong performance relative to alternative procedures.

5.1 Degrees of Freedom for Randomized Forward Selection

{algorithm}

[t] Randomized Forward Selection (RandFS) {algorithmic} \ProcedureRandFS𝒟n,B,d,𝚖𝚝𝚛𝚢 \Forb=1,,B \StateDraw bootstrap sample 𝒟(b)={(𝑿i(b),Yi(b))}i=1n from original data 𝒟n

\State

Initialize empty active set A0={0} \Fork1d \StateSelect subset of 𝚖𝚝𝚛𝚢×p features uniformly at random, Fk \StateSelect jk=argminjFkY(b)-PAk-1{jk}Y(b)22

\State

Update active set Ak=Ak-1{jk} \StateUpdate coefficient estimates β^(b) as

β^Ak(b)=argminβY(b)-𝑿Ak(b)β2,β^Akc(b)=0
\EndFor\EndFor\State

Compute final coefficient estimates β^=1Bb=1Bβ^(b) \StateCompute predictions Y^=Xβ^ \EndProcedure

Algorithm 5.1 formalizes the notion of randomized forward selection (RandFS), which can be seen as a random forest analogue to traditional forward stepwise selection (FS). Note that for any subset of the feature indices 𝒮, we define 𝑿𝒮 as the matrix of feature values whose index is in 𝒮 and P𝒮 as the projection matrix onto the column span of 𝑿𝒮. To carry out RandFS, we begin by drawing B bootstrap samples from the original data and performing forward selection on each. However, like random forests, at each step, only a random selection of 𝚖𝚝𝚛𝚢×p features are eligible for being included in the model. The process continues until the desired model size (depth) d is obtained and the final predictions are taken as an average over the predictions generated by each individual model. When 𝚖𝚝𝚛𝚢=1 so that all features are eligible at each step, we refer to the procedure as bagged forward selection (BaggFS).

We begin by estimating the dof for RandFS as well as for FS, lasso, and relaxed lasso. Here we follow the same initial setup utilized in Hastie et al. (2017) where we assume a linear model Y=Xβ+ϵ with n=70, p=30, and s=5. Rows of X are sampled independently from Np(0,Σ), where the (i,j)th entry of Σ takes the form ρ|i-j| with ρ=0.35 and errors are sampled independently from N(0,σ2) with σ2 chosen to satisfy a particular SNR, in this case 0.7. The first s components of β are set equal to 1 with the rest equal to 0.

The plots in Figure 7 show the estimated dof for the various methods against the number of nonzero coefficient estimates produced. Each point in each plot corresponds to a Monte Carlo estimate of the dof formula given in (3) evaluated over 500 iterations. As expected, the plot on the right shows quite clearly that as with random forests, larger values of mtry produce RandFS procedures with more dof. More surprising is the relative relationship of the RandFS models to the more classical procedures. The plot on the left shows that the dof for BaggFS is almost identical to that of standard FS. The standard mtry value of 0.33 produces a RandFS model with dof very similar to that of the relaxed lasso with γ=0, corresponding to least squares after lasso. Smaller values of mtry appear to offer even more regularization, producing dof similar to relaxed lasso with larger γ values, corresponding to an estimate where more weight is put on the original lasso coefficient estimates.

Figure 7: Estimated dof for forward selection (FS), bagged forward selection (BaggFS), randomized forward selection (RandFS), lasso, and relaxed lasso.

5.2 Relative Performance of Randomized Forward Selection

Building on the intuition from previous sections as well as that provided in Hastie et al. (2017), the dof results in Figure 7 suggest that we should expect to see RandFS models with small values of mtry have a potential advantage in predictive accuracy relative to FS and BaggFS at low SNRs. We now compare the performance of RandFS relative to BaggFS and the more classical procedures.

Here we consider several linear model setups taken directly from Hastie et al. (2017) and similar to those considered in Section 3. We consider four settings:

  • Low: n=100, p=10, s=5

  • Medium: n=500, p=100, s=5

  • High-5: n=50, p=1000, s=5

  • High-10: n=100, p=1000, s=10.

As above, rows of X are independently drawn from N(0,Σ), where Σp×p has entry (i,j) = ρ|i-j| with ρ=0.35 and where we set the first s components of β equal to 1 with the rest set to 0, corresponding to the beta-type 2 setting in Hastie et al. (2017). Noise is once again sampled from N(0,σ2I) where σ2 is chosen to produce a corresponding SNR level ν and we consider the same 10 values ν=0.05,0.09,0.14,,6.00 utilized above.

Tuning of the FS, lasso, and relaxed lasso procedures is done in exactly the same fashion as in Hastie et al. (2017). In all cases, tuning parameters are optimized on an independent validation set of size n. For the low setting, the lasso shrinkage parameter λ follows the default glmnet settings being tuned across 50 values ranging from small to large fractions of λmax=XTY. For relaxed lasso, λ is tuned across the same 50 values and the γ parameter that weights the average of the lasso and LS-after-lasso estimates is chosen from 10 equally spaced values between 0 and 1. The depth of the models in FS, BaggFS, and RandFS are tuned across d=0,1,2,,10. Note that for BaggFS and RandFS, a selected depth of d means that each individual model is built to a depth of d and the final average is then taken; since different individual models will generally select different features, the final averaged model will generally contain more than d features. In addition to considering the default RandFS (𝚖𝚝𝚛𝚢=0.33) and BaggFS (𝚖𝚝𝚛𝚢=1), we also consider tuning the mtry in RandFS across 10 equally spaced values between 0.1 and 1. In the medium, high-5, and high-10 settings, the λ parameter in lasso and relaxed lasso is tuned across 100 values rather than 50 and model depths for FS, BaggFS, and RandFS are tuned across d=0,1,2,,50; all other setups remain the same.

Figure 8: Performance of FS, BaggFS, RandFS, lasso and relaxed lasso across SNR levels for linear models in the low, medium, high-5, and high-10 settings.

To measure performance, we again follow the lead of Hastie et al. (2017) and calculate the test error relative to the Bayes error rate. Specifically, given a test point z0=(x0,y0) with y0=x0Tβ+ϵ0 and ϵ0N(0,σ2), the relative test error (RTE) to Bayes of a regression estimate β^ is given by

RTE(β^)=𝔼(y0-x0Tβ^)2σ2=(β^-β)TΣ(β^-β)+σ2σ2.

Results are shown in Figure 8; the same plots with error bars corresponding to ±1 standard deviation are shown in Appendix A. Each point in each plot corresponds to an average over 100 replications. As expected, the explicit regularizers (lasso and relaxed lasso) perform well in the high-dimensional settings and at low SNRs. In the high-dimensional settings in particular, most methods perform similarly at low SNRs but for larger SNRs, relaxed lasso begins to perform substantially better.

In the low and medium settings, BaggFS largely performs as expected with respect to classical FS. At low SNRs, the variance stabilization offered by BaggFS allows it to outperform FS but that advantage dies out at medium SNRs and both procedures appear similarly optimal relative to the others at high SNRs. It can also be seen in these settings that as originally hypothesized above, RandFS with fixed 𝚖𝚝𝚛𝚢=0.33 outperforms BaggFS at low SNRs but loses the advantage at high SNRs.

The performance of RandFS in these settings with respect to the other procedures, however, is what is perhaps most surprising. Note that in the low setting, the RandFS procedure with tuned mtry outperforms all other methods – including lasso and relaxed lasso – until the mid-range SNR values. In the medium setting, RandFS exhibits a similar property to that of relaxed lasso, performing very well at low SNRs but adapting (likely selecting larger values of mtry) to also perform very well at large SNRs. Even more remarkable is the performance of RandFS when 𝚖𝚝𝚛𝚢=0.33 and is not tuned – in both the low and medium settings, even this default RandFS outperforms the lasso across all SNRs.

5.3 Randomization as Implicit Shrinkage

Before concluding out work, we provide some additional intuition into the implicit regularization that appears to be taking place with the randomized ensembles (random forests and RandFS) studied in previous sections. This is more apparent and easily described in the more classical linear model forward selection setting with RandFS, though the same kind of effect is likely present with random forests, which might be seen as simply a more complex form of randomized forward selection.

As above, suppose we have data of the form 𝒟n={Z1,,Zn} where each Zi=(𝑿i,Yi), 𝑿i=(X1,i,,Xp,i) denotes a vector of p features, Y denotes the response, and the variables have a general relationship of the form Y=f(X)+ϵ. Now suppose that we obtain a regression estimate f^RFS=Xβ^RFS by averaging over B models, each built via randomized forward selection on 𝒟n to a depth of d. Note that this is identical to the RandFS procedure described above except that for technical reasons, models are built on the original data each time rather than on bootstrap samples. Each of the B individual models produces an estimate of the form

β^RFS(b)=β^0(b)+X(1)(b)β^(1)(b)++X(d)(b)β^(d)(b)

where X(j)(b) is the feature selected at the jth step in the bth model and β^(j)(b) is the corresponding coefficient estimate. Now consider a particular feature, say X1, and suppose that the ordinary least squares (OLS) estimator exists and that the OLS coefficient estimate for X1 is given by β^1,OLS. More generally, given an active set 𝒜{1,,p} containing a subset of feature indices, let β^1,OLS|𝒜 denote the OLS estimate of the coefficient for X1 when calculated over only the features with indices in 𝒜.

Given an orthogonal design matrix, for any indexing set 𝒜, β^1,OLS|𝒜 is equal to β^1,OLS whenever 1𝒜 and equal to 0 otherwise. Thus, if 𝒜b denotes the indices of those features selected for inclusion in the bth model, then β^1(b)=β^1,OLS if X1 is selected (i.e. 1𝒜b) and β^1(b)=0 otherwise. The final coefficient estimate produced by RandFS is thus of the form

β^1,RFS=1Bi=1Bβ1(b)=α1β^1,OLS+(1-α1)0=α1β^1,OLS

where 0α11 denotes the proportion of models in which X1 appears. In practice, for each feature Xk, the selection proportion αk will depend on the particular data observed, the value of mtry, the depth d of each model, and the importance of Xk relative to the other features. In particular though, so long as d is relatively large, αk should still be expected to be somewhat large even for moderately small values of mtry whenever Xk is a relatively important feature because it will have a very good chance of being included in the model if it is made eligible at any step. Thus, in this sense, not only does RandFS have a shrinkage effect on each variable, but variables that appear more important by virtue of being included in many models will be shrunken by less than those included only occasionally. In a very recent study released almost simultaneously to this work, LeJeune et al. (2019) demonstrate the same kind of shrinkage effect for randomly subsampled OLS ensemble estimators.

6 Discussion

The results in the previous sections provide substantial evidence that the mtry parameter – the distinguishing feature of random forests – has an implicit regularization effect on the procedure. Much like the tuning parameter λ in explicit regularization procedures like lasso and ridge regression, mtry serves to mitigate overfitting to particular features. Speculation along these lines is discussed informally in Hastie et al. (2009) who point out that random forests seem to behave similarly to ridge regression. Thus, contrary to conventional wisdom, random forests are not simply “better” than bagging, but rather, the relative success of random forests depends heavily on the amount of noise present in the problem at hand. At low to moderate SNRs, random forests are seen to perform substantially better than bagging whereas bagging becomes far more competitive in high SNR regimes. This suggests further that at least in regression contexts, the success of random forests is not in fact due to any kind of potential interpolation as was recently hypothesized in Wyner et al. (2017).

The obvious question is then, “Why do random forests appear to work so well in practice on real datasets?” As discussed in the introduction, there is certainly considerable evidence that random forests do often perform very well in a variety of settings across numerous scientific domains. In our view, the clear answer is that in practice, many datasets simply are quite noisy. Hastie et al. (2017) provide a thorough discussion along these lines, arguing that although commonly employed in simulations, on real data, SNRs as large as 5 or 6 are extremely rare. If this is indeed the case, it would explain why random forests are so often viewed as inherently superior to bagging.

Finally, we note that our findings are very much in line both with previous findings and with common practice. In practical applications, random forests are generally used “off-the-shelf” without any tuning of the mtry parameter. The plots corresponding to the low and medium settings in Figure 8 may shed some light on this: though not always optimal, the procedure with a fixed mtry value of 0.33 generally performs quite well. On the other hand, the results provided above do strongly support the notion discussed in many previous studies (Díaz-Uriarte and De Andres, 2006; Genuer et al., 2008; Bernard et al., 2009; Genuer et al., 2010; Probst et al., 2019) that the mtry parameter can significantly influence performance. Just as with explicit regularization procedures, the results above suggest that the mtry parameter in random forests can be thought of as controlling the amount of shrinkage and regularization and thus is best tuned in practice.

Acknowledgements

This research was supported in part by the University of Pittsburgh Center for Research Computing through the resources provided.

References

  • Ahmed et al. (2015) Mehreen Ahmed, Maham Jahangir, Hammad Afzal, Awais Majeed, and Imran Siddiqi. Using crowd-source based features from social media and conventional features to predict the movies popularity. In 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity), pages 273–278. IEEE, 2015.
  • Amit and Geman (1997) Yali Amit and Donald Geman. Shape quantization and recognition with randomized trees. Neural computation, 9(7):1545–1588, 1997.
  • Belkin et al. (2019) Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  • Bergström et al. (2003) Christel AS Bergström, Ulf Norinder, Kristina Luthman, and Per Artursson. Molecular descriptors influencing melting point and their role in classification of solid drugs. Journal of chemical information and computer sciences, 43(4):1177–1185, 2003.
  • Bernard et al. (2007) Simon Bernard, Sébastien Adam, and Laurent Heutte. Using random forests for handwritten digit recognition. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), volume 2, pages 1043–1047. IEEE, 2007.
  • Bernard et al. (2009) Simon Bernard, Laurent Heutte, and Sébastien Adam. Influence of hyperparameters on random forest accuracy. In International Workshop on Multiple Classifier Systems, pages 171–180. Springer, 2009.
  • Biau (2012) Gérard Biau. Analysis of a Random Forests Model. The Journal of Machine Learning Research, 98888:1063–1095, 2012.
  • Biau and Devroye (2010) Gérard Biau and Luc Devroye. On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification. Journal of Multivariate Analysis, 101(10):2499–2518, 2010.
  • Biau and Scornet (2016) Gérard Biau and Erwan Scornet. A random forest guided tour. Test, 25(2):197–227, 2016.
  • Biau et al. (2008) Gérard Biau, Luc Devroye, and Gábor Lugosi. Consistency of Random Forests and Other Averaging Classifiers. The Journal of Machine Learning Research, 9:2015–2033, 2008.
  • Biau et al. (2016) Gérard Biau, Erwan Scornet, and Johannes Welbl. Neural random forests. Sankhya A, pages 1–40, 2016.
  • Breiman (1996) Leo Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996.
  • Breiman (2000) Leo Breiman. Randomizing outputs to increase prediction accuracy. Machine Learning, 40(3):229–242, 2000.
  • Breiman (2001) Leo Breiman. Random Forests. Machine Learning, 45:5–32, 2001.
  • Breiman et al. (1984) Leo Breiman, Jerome Friedman, Charles J. Stone, and R.A. Olshen. Classification and Regression Trees. Wadsworth, Belmont, CA, 1st edition, 1984.
  • Cammarata (1972) Arthur Cammarata. Interrelation of the regression models used for structure-activity analyses. Journal of medicinal chemistry, 15(6):573–577, 1972.
  • Chen et al. (2001) Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by basis pursuit. SIAM review, 43(1):129–159, 2001.
  • Coleman et al. (2017) Tim Coleman, Lucas Mentch, Daniel Fink, Frank La Sorte, Giles Hooker, Wesley Hochachka, and David Winkler. Statistical inference on tree swallow migrations with random forests. arXiv preprint arXiv:1710.09793, 2017.
  • Coleman et al. (2019) Tim Coleman, Wei Peng, and Lucas Mentch. Scalable and efficient hypothesis testing with random forests. arXiv preprint arXiv:1904.07830, 2019.
  • Cui et al. (2017) Yifan Cui, Ruoqing Zhu, Mai Zhou, and Michael Kosorok. Some asymptotic results of survival tree and forest models. arXiv preprint arXiv:1707.09631, 2017.
  • Cutler et al. (2007) D Richard Cutler, Thomas C Edwards Jr, Karen H Beard, Adele Cutler, Kyle T Hess, Jacob Gibson, and Joshua J Lawler. Random forests for classification in ecology. Ecology, 88(11):2783–2792, 2007.
  • Díaz-Uriarte and De Andres (2006) Ramón Díaz-Uriarte and Sara Alvarez De Andres. Gene selection and classification of microarray data using random forest. BMC bioinformatics, 7(1):3, 2006.
  • Dietterich (2000) Thomas G Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine learning, 40(2):139–157, 2000.
  • Dua and Graff (2017) Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Duroux and Scornet (2016) Roxane Duroux and Erwan Scornet. Impact of subsampling and pruning on random forests. arXiv preprint arXiv:1603.04261, 2016.
  • Efron (1986) Bradley Efron. How biased is the apparent error rate of a prediction rule? Journal of the American statistical Association, 81(394):461–470, 1986.
  • Efron and Tibshirani (1990) Bradley Efron and Robert Tibshirani. Generalized Additive Models. Chapman and Hall, London, 1990.
  • Ein-Dor and Feldmesser (1987) Phillip Ein-Dor and Jacob Feldmesser. Attributes of the performance of central processing units: A relative performance prediction model. Communications of the ACM, 30(4):308–318, 1987.
  • Fanaee-T and Gama (2014) Hadi Fanaee-T and Joao Gama. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, 2(2-3):113–127, 2014.
  • Fanelli et al. (2013) Gabriele Fanelli, Matthias Dantone, Juergen Gall, Andrea Fossati, and Luc Van Gool. Random forests for real time 3d face analysis. International Journal of Computer Vision, 101(3):437–458, 2013.
  • Fernández-Delgado et al. (2014) Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 15(1):3133–3181, 2014.
  • Freund et al. (1996) Yoav Freund, Robert E Schapire, et al. Experiments with a new boosting algorithm. In icml, volume 96, pages 148–156. Citeseer, 1996.
  • Friedman (1991) Jerome H Friedman. Multivariate Adaptive Regression Splines. The annals of statistics, pages 1–67, 1991.
  • Genuer et al. (2008) Robin Genuer, Jean-Michel Poggi, and Christine Tuleau. Random forests: some methodological insights. arXiv preprint arXiv:0811.3619, 2008.
  • Genuer et al. (2010) Robin Genuer, Jean-Michel Poggi, and Christine Tuleau-Malot. Variable selection using random forests. Pattern Recognition Letters, 31(14):2225–2236, 2010.
  • Guha and Jurs (2004) Rajarshi Guha and Peter C Jurs. Development of linear, ensemble, and nonlinear models for the prediction and interpretation of the biological activity of a set of pdgfr inhibitors. Journal of Chemical Information and Computer Sciences, 44(6):2179–2189, 2004.
  • Guo et al. (2011) Li Guo, Nesrine Chehata, Clément Mallet, and Samia Boukir. Relevance of airborne lidar and multispectral image data for urban scene classification using random forests. ISPRS Journal of Photogrammetry and Remote Sensing, 66(1):56–66, 2011.
  • Harrison Jr and Rubinfeld (1978) David Harrison Jr and Daniel L Rubinfeld. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1):81–102, 1978.
  • Hastie et al. (2009) Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, 2nd edition, 2009.
  • Hastie et al. (2017) Trevor Hastie, Robert Tibshirani, and Ryan J Tibshirani. Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692, 2017.
  • He and Jurs (2005) Linnan He and Peter C Jurs. Assessing the reliability of a qsar model’s predictions. Journal of Molecular Graphics and Modelling, 23(6):503–523, 2005.
  • Hooker and Mentch (2019) Giles Hooker and Lucas Mentch. Please stop permuting features: An explanation and alternatives. arXiv preprint arXiv:1905.03151, 2019.
  • Hothorn et al. (2005) Torsten Hothorn, Peter Bühlmann, Sandrine Dudoit, Annette Molinaro, and Mark J Van Der Laan. Survival ensembles. Biostatistics, 7(3):355–373, 2005.
  • Huang et al. (2010) Chen Huang, Xiaoqing Ding, and Chi Fang. Head pose estimation based on random forests for multiclass classification. In 2010 20th International Conference on Pattern Recognition, pages 934–937. IEEE, 2010.
  • Ishwaran et al. (2008) Hemant Ishwaran, Udaya B Kogalur, Eugene H Blackstone, Michael S Lauer, et al. Random survival forests. The annals of applied statistics, 2(3):841–860, 2008.
  • Klusowski (2019a) Jason M. Klusowski. Analyzing cart. arXiv preprint 1906.10086v5, 2019a.
  • Klusowski (2019b) Jason M. Klusowski. Sharp analysis of a simple model for random forests. arXiv preprint 1805.02587v6, 2019b.
  • LeJeune et al. (2019) Daniel LeJeune, Hamid Javadi, and Richard G Baraniuk. The implicit regularization of ordinary least squares ensembles. arXiv preprint arXiv:1910.04743, 2019.
  • Li et al. (2000) Jinyan Li, Guozhu Dong, and Kotagiri Ramamohanarao. Instance-based classification by emerging patterns. In European Conference on Principles of Data Mining and Knowledge Discovery, pages 191–200. Springer, 2000.
  • Liaw et al. (2002) Andy Liaw, Matthew Wiener, et al. Classification and regression by randomforest. R news, 2(3):18–22, 2002.
  • Lin and Jeon (2006) Yi Lin and Yongho Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474):578–590, 2006.
  • Lopes et al. (2019a) Miles E Lopes, Suofei Wu, and Thomas Lee. Measuring the algorithmic convergence of randomized ensembles: The regression setting. arXiv preprint arXiv:1908.01251, 2019a.
  • Lopes et al. (2019b) Miles E Lopes et al. Estimating the algorithmic variance of randomized ensembles via the bootstrap. The Annals of Statistics, 47(2):1088–1112, 2019b.
  • Mehrmohamadi et al. (2016) Mahya Mehrmohamadi, Lucas K Mentch, Andrew G Clark, and Jason W Locasale. Integrative modelling of tumour dna methylation quantifies the contribution of metabolism. Nature communications, 7:13666, 2016.
  • Meinshausen (2006) Nicolai Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7(Jun):983–999, 2006.
  • Meinshausen (2007) Nicolai Meinshausen. Relaxed lasso. Computational Statistics & Data Analysis, 52(1):374–393, 2007.
  • Mentch and Hooker (2016) Lucas Mentch and Giles Hooker. Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. The Journal of Machine Learning Research, 17(1):841–881, 2016.
  • Mentch and Hooker (2017) Lucas Mentch and Giles Hooker. Formal hypothesis tests for additive structure in random forests. Journal of Computational and Graphical Statistics, 26(3):589–597, 2017.
  • Moro et al. (2016) Sérgio Moro, Paulo Rita, and Bernardo Vala. Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach. Journal of Business Research, 69(9):3341–3351, 2016.
  • Nicodemus et al. (2010) Kristin K Nicodemus, James D Malley, Carolin Strobl, and Andreas Ziegler. The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC bioinformatics, 11(1):110, 2010.
  • Olson and Wyner (2018) Matthew A Olson and Abraham J Wyner. Making sense of random forest probabilities: a kernel perspective. arXiv preprint arXiv:1812.05792, 2018.
  • Peng et al. (2019) Wei Peng, Tim Coleman, and Lucas Mentch. Asymptotic distributions and rates of convergence for random forests and other resampled ensemble learners. arXiv preprint arXiv:1905.10651, 2019.
  • Prasad et al. (2006) Anantha M Prasad, Louis R Iverson, and Andy Liaw. Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems, 9(2):181–199, 2006.
  • Probst and Boulesteix (2017) Philipp Probst and Anne-Laure Boulesteix. To tune or not to tune the number of trees in random forest. Journal of Machine Learning Research, 18:181–1, 2017.
  • Probst et al. (2019) Philipp Probst, Marvin N Wright, and Anne-Laure Boulesteix. Hyperparameters and tuning strategies for random forest. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(3):e1301, 2019.
  • Quinlan (1993) J Ross Quinlan. Combining instance-based and model-based learning. In Proceedings of the tenth international conference on machine learning, pages 236–243, 1993.
  • Scornet (2016) Erwan Scornet. Random forests and kernel methods. IEEE Transactions on Information Theory, 62(3):1485–1500, 2016.
  • Scornet (2017) Erwan Scornet. Tuning parameters in random forests. ESAIM: Proceedings and Surveys, 60:144–162, 2017.
  • Scornet et al. (2015) Erwan Scornet, Gérard Biau, Jean-Philippe Vert, et al. Consistency of random forests. The Annals of Statistics, 43(4):1716–1741, 2015.
  • Sexton and Laake (2009) Joseph Sexton and Petter Laake. Standard errors for bagged and random forest estimators. Computational Statistics & Data Analysis, 53(3):801–811, 2009.
  • Steingrimsson et al. (2019) Jon Arni Steingrimsson, Liqun Diao, and Robert L Strawderman. Censoring unbiased regression trees and ensembles. Journal of the American Statistical Association, 114(525):370–383, 2019.
  • Strobl et al. (2007) Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1):25, 2007.
  • Strobl et al. (2008) Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis. Conditional variable importance for random forests. BMC bioinformatics, 9(1):307, 2008.
  • Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • Tibshirani (2015) Ryan J Tibshirani. Degrees of freedom and model search. Statistica Sinica, pages 1265–1296, 2015.
  • Todeschini et al. (1995) R Todeschini, P Gramatica, R Provenzani, and E Marengo. Weighted holistic invariant molecular descriptors. part 2. theory development and applications on modeling physicochemical properties of polyaromatic hydrocarbons. Chemometrics and Intelligent Laboratory Systems, 27(2):221–229, 1995.
  • Toloşi and Lengauer (2011) Laura Toloşi and Thomas Lengauer. Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics, 27(14):1986–1994, 2011.
  • Tsanas et al. (2009) Athanasios Tsanas, Max A Little, Patrick E McSharry, and Lorraine O Ramig. Accurate telemonitoring of parkinson’s disease progression by noninvasive speech tests. IEEE transactions on Biomedical Engineering, 57(4):884–893, 2009.
  • Vanschoren et al. (2013) Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked science in machine learning. SIGKDD Explorations, 15(2):49–60, 2013. doi: 10.1145/2641190.2641198. URL http://doi.acm.org/10.1145/2641190.2641198.
  • Wager and Athey (2018) Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242, 2018.
  • Wager et al. (2014) Stefan Wager, Trevor Hastie, and Bradley Efron. Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. The Journal of Machine Learning Research, 15(1):1625–1651, 2014.
  • Waugh (1995) Samuel George Waugh. Extending and benchmarking Cascade-Correlation: extensions to the Cascade-Correlation architecture and benchmarking of feed-forward supervised artificial neural networks. PhD thesis, University of Tasmania, 1995.
  • Welbl (2014) Johannes Welbl. Casting random forests as artificial neural networks (and profiting from it). In German Conference on Pattern Recognition, pages 765–771. Springer, 2014.
  • Wyner et al. (2017) Abraham J Wyner, Matthew Olson, Justin Bleich, and David Mease. Explaining the success of adaboost and random forests as interpolating classifiers. The Journal of Machine Learning Research, 18(1):1558–1590, 2017.
  • Yeh (1998) I-C Yeh. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete research, 28(12):1797–1808, 1998.
  • Zaman and Hirose (2009) Faisal Zaman and Hideo Hirose. Effect of subsampling rate on subbagging and related ensembles of stable classifiers. In International Conference on Pattern Recognition and Machine Intelligence, pages 44–49. Springer, 2009.
  • Zhu et al. (2015) Ruoqing Zhu, Donglin Zeng, and Michael R Kosorok. Reinforcement learning trees. Journal of the American Statistical Association, 110(512):1770–1784, 2015.

Appendix A Additional Simulations and Figures

In Section 3, Figure 3 shows the estimated degrees of freedom for random forests across various values of mtry in four different regression setups at a fixed SNR of 3.52. In each case we see that the dof increases with both maxnodes and mtry and we note that the same relationships were seen to also emerge in alternative setups. Figure 9 below shows the same experiments carried out with the SNR set to the much smaller level of 0.09 and indeed the findings remain the same.

Figure 9: Degrees of freedom for random forests at different levels of mtry with a SNR of 0.09.

In Section 4.2 we considered the problem of estimating the optimal value of mtry for random forests for both the linear and MARS models at various SNR levels. In the main text, Figure 5 shows plots of the optimal mtry measured as that which obtains the lowest average test error over 500 iterations. In contrast, Figure 10 here calculates the optimal mtry value on each of the 500 iterations and then takes the empirical mean. The results in Figure 10 show the same general pattern as seen in Figure 5. As the SNR increases, so does the optimal value of mtry.

Figure 10: Optimal value of mtry vs SNR for the MARS and linear model.

Figure 11 shows exactly the same plots as in Figure 8 but here we add error bars at each point corresponding to ±1 standard deviation across the 100 replications. These are reserved for the appendix only for readability as the error bars can make the plots a bit more muddled and difficult to make out.

Figure 11: Performance of FS, BaggFS, RandFS, lasso and relaxed lasso across SNR levels for linear models in the low, medium, high-5, and high-10 settings. Error bars at each point correspond to ±1 standard deviation across the 100 replications.