Abstract
We present Neural Random Forest Imitation  a novel approach for transformingrandom forests into neural networks. Existing methods produce very inefficientarchitectures and do not scale. In this paper, we introduce a new method forgenerating data from a random forest and learning a neural network thatimitates it. Without any additional training data, this transformation createsvery efficient neural networks that learn the decision boundaries of a randomforest. The generated model is fully differentiable and can be combined withthe feature extraction in a single pipeline enabling further endtoendprocessing. Experiments on several realworld benchmark datasets demonstrateoutstanding performance in terms of scalability, accuracy, and learning withvery few training examples. Compared to stateoftheart mappings, wesignificantly reduce the network size while achieving the same or even improvedaccuracy due to better generalization.
Quick Read (beta)
Neural Random Forest Imitation
Abstract
We present Neural Random Forest Imitation – a novel approach for transforming random forests into neural networks. Existing methods produce very inefficient architectures and do not scale. In this paper, we introduce a new method for generating data from a random forest and learning a neural network that imitates it. Without any additional training data, this transformation creates very efficient neural networks that learn the decision boundaries of a random forest. The generated model is fully differentiable and can be combined with the feature extraction in a single pipeline enabling further endtoend processing. Experiments on several realworld benchmark datasets demonstrate outstanding performance in terms of scalability, accuracy, and learning with very few training examples. Compared to stateoftheart mappings, we significantly reduce the network size while achieving the same or even improved accuracy due to better generalization.
1 Introduction
In recent years, convolutional neural networks have become very popular in various tasks such as image classification [Krizhevsky2012AlexNet, Simonyan15], object detection [liu2016ssd, redmon2018yolov3, ren2015faster], or semantic segmentation [Zhao2017PSPNet, RFB15a]. Researchers have published many datasets such as CIFAR10 [Krizhevsky2012CIFAR] and ImageNet [imagenet_cvpr09] for training neural networks and put enormous effort to provide labels for each individual image. CIFAR10 has $\mathrm{60\hspace{0.17em}000}$ labeled examples and ImageNet over $14$ million labeled examples. These datasets are a great basis for training and comparing algorithms on a specific benchmark dataset.
For realworld applications, the dependency on large amounts of labeled data represents a major limitation. Frequently, there is little or even no labeled data for a particular task and hundreds or thousands of examples have to be collected and annotated. This particularly affects new applications and rare categories (e.g rare animals). Transfer learning and regularization methods are usually applied to reduce overfitting. However, for learning with little data, the networks still have a huge number of parameters that have to be finetuned – even if just the last layers are trained.
The combination of convolutional neural networks brings both worlds together. CNNs have demonstrated great performance in learning expressive feature representations, but require large amount of training data. RFs are very good in learning with very little data without overfitting. The advantages of mapping random forests into neural networks are threefold: (1) All components can be combined in a single pipeline and existing highperformance deep learning frameworks can be used directly. This accelerates the processing and enables readily GPU processing. (2) The mapping initializes neural networks with very few training examples enabling a warmstart. (3) The resulting network can be further finetuned endtoend jointly optimizing the features as well as the classifier.
Stateoftheart methods for mapping random forest into neural networks [Sethi1990RFNN, welbl2014casting, massiceti2017random], create a twohiddenlayer neural network by adding a neuron for each split node and each leaf node of the decision trees. The size of the networks become very large as the number of nodes grows exponentially with increasing depth of the decision trees. Additionally, many weights are set to zero, so that an inefficient representation is created. Due to both reasons, the mappings do not scale and are only applicable to simple random forests.
In this work, we present a transformation of random forests into neural networks which creates a very efficient neural network. We introduce a method for generating data from a random forest which creates any amount of input data and corresponding labels. With this data, a neural network is trained that learns to imitate the random forest. Experiments demonstrate that the accuracy of the imitating neural network is equal to the original accuracy or even outperforms the original random forest due to better generalization while being significant smaller (see Fig. 1).
To summarize, our contributions are:

•
We propose a novel method for transforming random forests into neural networks by generating data from a random forest and training an imitating neural network. Labeled data samples are created by evaluating the decision boundaries and guided routing to selected leaf nodes.

•
Our transformation is scalable to complex classifiers and deep random forests.

•
We enable learning and initializing neural networks with very little data.

•
Convolutional neural networks and random forests can be combined in an fully differentiable, endtoend pipeline for acceleration and further finetuning.

•
We will publish our source code upon acceptance and provide an easytouse toolbox for imitating random forest with neural networks.
2 Related Work
Convolutional Neural Networks
Convolutional neural networks have a long history [SCHMIDHUBER201585] and started with handwriting recognition as one of the first applications [LeCun:1989:BAH:1351079.1351090]. Publicly available image datasets such as ImageNet [imagenet_cvpr09] contributed greatly to the success. Krizhevsky et al. [Krizhevsky2012AlexNet] played an important role proposing a neural network called AlexNet which reduces the top5 test error on ImageNet from $26.2\%$ to $15.3\%$. The architecture consists of multiple alternating convolutional layers and activation functions (optionally followed by dropout or pooling layers) to generate features followed by fullyconnected layers. In the following years a series of improvements have been proposed. Simonyan and Zisserman [Simonyan15] further decreased the top5 test error to $7.3\%$ by introducing a deeper but simplified architecture based on $3\times 3$ filter kernels. More complex structure such as GoogLeNet [Szegedy15] use Inception modules for multiscale processing. ResNet [He2016DeepRL] introduce residual connections for training much deeper neural networks without the problem of vanishing gradients. Huang et al. [Huang_2017_CVPR] further increase the connectivity with densely connected layers.
Transfer Learning
Learning in one domain and transferring the learned knowledge to another domain has become a standard technique. Convolutional neural networks have demonstrated great performance in learning feature representations [Agrawal2014, pmlrv32donahue14]. The transition from general to specific convolution layers and their transferability has been studied by Yosinski et al. [Yosinski:2014:TFD:2969033.2969197]. Huh et al. [DBLP:journals/corr/HuhAE16] perform an indepth analysis of ImageNet features. Li et al. [Li2016] propose a method for finetuning without forgetting original learned capabilities.
Combining CNNs and RFs
Random forests [breiman2001random] are an ensemble learning method for classification and regression tasks. Sethi [Sethi1990RFNN] and Welbl [welbl2014casting, Biau2018] presented a technique for mapping random forests to neural networks. The direct mapping creates a twohiddenlayer neural network by adding a neuron in the first hidden layer for every split node and a neuron in the second hidden layer for every leaf node. The weights and biases are set according to the parameters of the decision trees. Massiceti et al. [massiceti2017random] extended this approach and introduced a network splitting strategy by dividing each decision tree into multiple subtrees of depth $\stackrel{~}{r}$.
These techniques, however, are only applicable for trees of limited depth. As the number of nodes grows exponentially with increasing depth of the trees, a very inefficient representation is created causing a very high memory consumption. In this paper, we address this issue by proposing a much more efficient transformation.
3 Background and Notation
Random forests [breiman2001random] are an ensemble method consisting of multiple decision trees. In this section, we briefly describe decision trees, random forests and the notation used throughout this paper.
Decision Trees
Decision trees consist of split nodes ${N}^{\text{Split}}$ and leaf nodes ${N}^{\text{Leaf}}$. Each split node $s\in {N}^{\text{Split}}$ performs a split decision and routes a data sample $x$ to the left or right child node, denoted as ${\mathrm{c}}_{\text{left}}$ and ${\mathrm{c}}_{\text{right}}(s)$, respectively. When using axisaligned split decisions, a single feature $f(s)$ and a threshold $\theta (s)$ are the bases for the split. If the value of feature $f(s)$ is smaller than $\theta (s)$, the data sample is routed to the left child node and otherwise to the right child node, denoted as
$x\in {\mathrm{c}}_{\text{left}}(s)$  $$  (1)  
$x\in {\mathrm{c}}_{\text{right}}(s)$  $\iff {x}_{\mathrm{f}(s)}\ge \theta (s).$  (2) 
Data samples are routed through a decision tree until a leaf node $l\in {N}^{\text{Leaf}}$ is reached which store the target value. For the classification task, these are the class probabilities ${P}_{\text{leaf}}(l)=({p}_{1}^{l},\mathrm{\dots},{p}_{C}^{l})$, where $C$ is the number of classes.
Random Forests
A single decision tree is very fast and operates on highdimensional data, however, tends to overfit the training data by constructing a deep tree which perfectly separates all training examples. While having a very small training error, this easily results in a large test error.
Random forests address this problem by learning an ensemble of ${n}_{T}$ decision trees. Each tree is trained with a random subset of training examples and features. The prediction $\mathrm{RF}(x)$ for a random forest is calculated by averaging the predictions of all decision trees.
4 Neural Random Forest Imitation
In this section, a novel approach called Neural Random Forest Imitation (NRFI) for transforming random forests into neural networks is presented. The main concepts include (1) generating training data from decision trees and random forests, (2) adding strategies for reducing conflicts and increasing the variety of the generated examples, and (3) training a neural network that imitates the random forest by learning the decision boundaries. As a result, NRFI enables the transformation of random forest into very efficient neural networks.
4.1 Data Generation
First, we propose a method for generating data from a random forest. A data sample $x\in {\mathbb{R}}^{N}$ is a $N$dimensional vector where $N$ is the number of features. We select a target class $t\in [1,\mathrm{\dots},C]$ and adjust the data sample so that it is classified as such. For that, the decision boundaries from the root nodes to the leaf nodes are evaluated and the data is modified so that it is routed to selected leaf nodes.
4.1.1 Data Initialization
A data sample $x$ is initialized randomly. Let ${f}_{\text{mean}},{f}_{\text{std}}\in {\mathbb{R}}^{N}$ be the mean and standard deviation per feature. We sample $x\sim \mathcal{N}({f}_{\text{mean}},{c}_{\text{std}}\cdot {f}_{\text{std}})$, where the standard deviation is multiplied by a factor ${c}_{\text{std}}\ge 1$ to increase the distribution of the data samples. The data is clipped to the feature minimum ${f}_{\text{min}}$ and maximum ${f}_{\text{max}}$.
In the next step, elements of $x$ are randomly set to zero, each with a probability of ${p}_{\text{zero}}$. This strategy is especially designed for features that are extract from activation functions like ReLU where a larger number of elements is zero.
4.1.2 Data Generation from Decision Tree
A decision tree processes an input vector $x$ by routing the data through the tree until a leaf is reached. At each node, a split decision is evaluated and the input is passed to the left child node or to the right child node. Finally, a leaf $l$ is reached which stores the probabilities ${p}^{l}=({p}_{1}^{l},\mathrm{\dots},{p}_{C}^{l})$ for each class.
We reverse this process and start with a method for generating training data from a decision tree. The pseudo code is presented in Algorithm 4.1.2. First, the class weights $W(n)=({w}_{1}^{n},\mathrm{\dots},{w}_{C}^{n})$ for every node $n$ are defined:
$$W(n)=\{\begin{array}{cc}{P}_{\text{leaf}}(n)\hfill & n\in {N}^{\text{Leaf}}\hfill \\ W({\mathrm{c}}_{\text{left}}(n))+W({\mathrm{c}}_{\text{right}}(n))\hfill & n\in {N}^{\text{Split}}\hfill \end{array}$$  (3) 
For every leaf node, the class weights are equal to the stored probabilities in the leaf. For every split node, the class weights in the child nodes are summed up.
The aim of the algorithm is to generate a data sample for target class $t$. For that, characteristics of the target class are successively added to the data sample. Starting at the root node, we modify the input data is so that it is routed though selected split nodes until a leaf node is reached. For learning the decision boundaries, a new value is sampled around the threshold $\theta (n)$ of the current node $n$ with standard deviation ${f}_{\text{std},\mathrm{f}(n)}$ when a feature is used for the first time:
$${x}_{\mathrm{f}(n)}\sim \mathcal{N}(\theta (n),{f}_{\text{std},\mathrm{f}(n)}).$$  (4) 
The routing is guided based on the weights for the target class in the left child node ${w}_{\text{left}}={w}_{t}^{{\mathrm{c}}_{\text{left}}(n)}$ and right child node ${w}_{\text{right}}={w}_{t}^{{\mathrm{c}}_{\text{right}}(n)}$. The weights are normalized using L2norm denoted as ${\widehat{p}}_{\text{left}}$ and ${\widehat{p}}_{\text{right}}$. Afterwards, the left or right child node is randomly selected with ${\widehat{p}}_{\text{left}}$ and ${\widehat{p}}_{\text{right}}$, respectively, as the next node ${n}_{\text{next}}$.
In the next step, we verify that the data sample is routed to the selected child node by evaluating the split decision. A split node $s$ routes the data to the left or right child node based on a split feature $\mathrm{f}(s)$ and a threshold $\theta (s)$. If the value of the split feature ${x}_{\mathrm{f}(s)}$ is smaller than $\theta (s)$, the data sample is routed to the left child node and otherwise to the right child node. The data sample is modified if it is not already routed to the selected child node by assigning a new value. If the selected child node is the left child node, the value has to be smaller than the threshold $\theta (s)$ and a new value within the minimum feature value ${f}_{\text{min},\mathrm{f}(n)}$ and $\theta (s)$ is randomly sampled:
${x}_{\mathrm{f}(n)}\sim U({f}_{\text{min},\mathrm{f}(n)},\theta (n)).$  (5) 
If the data sample is supposed to be routed to the right child node, the new value is randomly sampled between $\theta (s)$ and the maximum feature value ${f}_{\text{max},\mathrm{f}(n)}$:
${x}_{\mathrm{f}(n)}\sim U(\theta (n),{f}_{\text{max},\mathrm{f}(n)}).$  (6) 
This process is repeated until a leaf node is reached. In each node characteristics are added that classify the data sample as the target class.
During this process, modifications can conflict with previous decisions because features are used multiple times within a decision tree or across multiple decision trees. We present a strategy for minimizing conflicting routings denoted as path weighting (PW). The current routing – either ${w}_{\text{left}}$ or ${w}_{\text{right}}$ – is weighted with a factor ${w}_{\text{path}}\ge 1$ to prioritize the path and do not change the data sample if possible. This effect is particularly interesting when random forests are processed in the next section.
Overall, the presented method enables the generation of data samples and corresponding labels from a decision tree without adding any further data.
[t]
Input: Data sample $x$, target class $t$, as well as feature
minimum ${f}_{\text{min}}$, maximum ${f}_{\text{max}}$, and standard deviation ${f}_{\text{std}}$
Output: Data sample for target class $t$
{algorithmic}[1]
\State$n\leftarrow $ root node
\While$n\notin {N}^{\text{Leaf}}$
\Iffeature $\mathrm{f}(n)$ is used for the first time
\State${x}_{\mathrm{f}(n)}\sim \mathcal{N}(\theta (n),{f}_{\text{std},\mathrm{f}(n)})$
\EndIf\State${w}_{\text{left}}\leftarrow {w}_{t}^{{\mathrm{c}}_{\text{left}}(n)}$
\State${w}_{\text{right}}\leftarrow {w}_{t}^{{\mathrm{c}}_{\text{right}}(n)}$
\Iffeature $\mathrm{f}(n)$ is already used
\Stateweight current route with ${w}_{\text{path}}$
\EndIf\State${\widehat{p}}_{\text{left}},{\widehat{p}}_{\text{right}}\leftarrow $ normalize ${w}_{\text{left}}$ and ${w}_{\text{right}}$
\State${n}_{\text{next}}\leftarrow $ randomly select left or right child node with probability of ${\widehat{p}}_{\text{left}}$ and ${\widehat{p}}_{\text{right}}$, respectively
\If${n}_{\text{next}}={\mathrm{c}}_{\text{left}}(n)$
\If${x}_{\mathrm{f}(n)}\ge \theta (n)$
\State${x}_{\mathrm{f}(n)}\sim U({f}_{\text{min},\mathrm{f}(n)},\theta (n))$
\EndIf\Else
\If$$
\State${x}_{\mathrm{f}(n)}\sim U(\theta (n),{f}_{\text{max},\mathrm{f}(n)})$
\EndIf\EndIf\Statemark feature $\mathrm{f}(n)$ as used
\State$n\leftarrow {n}_{\text{next}}$
\EndWhile\State\Return$x$
4.1.3 Data Generation from Random Forest
In the next step, we extend the method to generate data from random forests. Random forests consists of ${n}_{T}$ decision trees $RF=\{{T}_{1},\mathrm{\dots},{T}_{{n}_{T}}\}$. For generating a data sample $x$, the presented method for a single decision tree is applied to multiple decision trees consecutively. In each decision tree, the data sample is modified and routed to selected nodes based on the target class $t$. When using all decision trees, data samples are created where all trees agree with a high probability.
We introduce a strategy called decision tree subset (DTS) for generating examples with varying confidence, i.e. the predictions of the individual decision trees diverge. To achieve this, a factor ${p}_{\text{forest}}\in [0,1]$ is defined determining the size of a subset of decision trees ${n}_{\text{sub}}=\lceil {n}_{T}\cdot {p}_{\text{forest}}\rceil $. A subset of ${n}_{\text{sub}}$ decision trees is randomly selected:
$$R{F}_{\text{sub}}\subseteq RF\text{with}R{F}_{\text{sub}}={n}_{\text{sub}}.$$  (7) 
All decision trees in $R{F}_{\text{sub}}$ are processed in random order to generate a data sample. For each decision tree the presented method modifies the data sample based on the target class. Finally, the output of the random forest $y=\mathrm{RF}(x)$ is predicted. In most cases, the prediction matches the target class. However, due to multiple factors such as the stochastic process, a small subset size, or varying predictions of the decision trees, it can also be different. As a result, an inputtarget pair $(x,y)$ has been created showing similar characteristics as the target class and any amount of data can be generated by repeating this process.
4.2 Imitation Learning
Finally, a neural network that imitates the random forest is trained. The network learns the decision boundaries from the generated data and approximates the same function as the random forest. The network architecture is based on a fullyconnected network with one or multiple hidden layers. The dimension of the input and output data is the same as of the random forest, i.e. a $N$dimensional input and $C$dimensional output. Each hidden layer is followed by a ReLU activation function [Nair:2010:ReLU]. The last fully connected layer is using a softmax activation function.
For training, we generate inputtarget pairs $(x,y)$ as described in the last section. Additionally, the data samples $x$ are normalized by subtracting the mean ${f}_{\text{mean}}$ and subsequently dividing by the absolute maximum ${f}_{\text{absmax}}$ per feature. These training examples are feed into the training process to teach the network predicting the same results as the random forest. To avoid overfitting, the data is generated onthefly so that each training example is unique. In this way, we learn an efficient representation of the decision boundaries and are able to transform random forest into neural networks.
5 Experiments
In this section, we present the experimental setup including the datasets, feature learning, training of the random forests, and neural random forest imitation. All experiments are performed on a NVIDIA GeForce GTX 1080 Ti using Tensorflow [tensorflow2015whitepaper] and Keras [chollet2015keras].
5.1 Datasets
We evaluate the proposed method on five standard benchmark datasets. These datasets cover a wide range of different image classification tasks. MNIST [lecunmnisthandwrittendigit2010] is a dataset of handwritten digits. CIFAR10 and CIFAR100 [Krizhevsky2012CIFAR] contain natural RGB images of size $32\times 32$ over 10 classes such as airplane, cat, or ship. GTSRB [Stallkamp2012] is a dataset for traffic sign classification. Caltech101 [FeiFei:2007] consists of images in $101$ categories and an additional background category.
Following [Yosinski:2014:TFD:2969033.2969197], each dataset is divided into two nonoverlapping subsets, denoted as ${D}_{S}$ and ${D}_{T}$. Both subsets contain half of the classes and approximately half of the data. Each subset is split into 60% for training, 20% for validation, and 20% for testing, if not already provided.
Finally, the number of training and validation examples of the target dataset is limited to ${n}_{\text{limit}}$ examples per class. In our experiments we set ${n}_{\text{limit}}$ to $20$. As a result, two datasets for transfer learning are created: A source dataset ${D}_{S}$ which has large amount of labeled data and is used for feature learning and a target dataset ${D}_{T}$ which has very few labeled training examples.
5.2 Feature Learning
For feature learning, we train a ResNet50 [He2016DeepRL] pretrained on ImageNet [imagenet_cvpr09]. The input images are resized to $32\times 32$ for MNIST, CIFAR10, and CIFAR100, $48\times 48$ for GTSRB, and $128\times 128$ for Caltech101. It should be noted that the images from MNIST are slightly upscaled because the minimum input size of the feature network is $32\times 32$.
Stochastic Gradient Descent is used for optimizing the network with learning rate of $0.001$, momentum of $0.9$, and weight decay of ${10}^{6}$. The batch size is set to $32$ and the input data is normalized to $[0.5,0.5]$. Augmentation is used by randomly applying rotation, shifting, shearing, zooming, and flipping (except for MNIST and GTSRB). The network is trained for $200$ epochs. After each epoch, we evaluate the performance on the validation set and select the model with minimum loss.
5.3 Feature Extraction
Based on the learned features, the network is used to extract a feature representation for every example from the target dataset ${D}_{T}$. To determine the best layer for extracting the features for each dataset, the performance on the validation dataset over all activation layers is evaluated. For each activation layer, we extract features, train a random forest with $100$ decision trees on the training set, and evaluate the performance on the validation set. This process is repeated ten times and the results are averaged.
Afterwards, the features for the training, validation, and test set are generated. For the training examples, five different random augmentations are applied to the input image (same methods as for feature learning). At the end, the minimum ${f}_{\text{min}}$, maximum ${f}_{\text{max}}$, absolute maximum ${f}_{\text{absmax}}$, mean ${f}_{\text{mean}}$, and standard deviation ${f}_{\text{std}}$ are calculated per feature on the extracted training features.
5.4 Random Forest
Random forests are capable of learning with very little labeled data. We train a random forest with ${n}_{T}$ decision trees on the extracted training features of the target dataset ${D}_{T}$ where only limited amount of data is available.
The number of decision trees is determined similar as before by training random forests with different number of decision trees ${n}_{T}\in \{5,10,\mathrm{\dots},100\}$ and evaluating the performance on the validation set. The experiment is repeated ten times. For each dataset, the best number of decision trees is selected based on the average validation accuracy.
5.5 Neural Random Forest Imitation
Method  MNIST [%]  CIFAR10 [%]  CIFAR100 [%]  GTSRB [%]  Caltech101 [%] 

RDG  $98.86\pm 0.19$  $69.69\pm 0.63$  $33.61\pm 0.95$  $89.66\pm 0.82$  $78.63\pm 0.91$ 
RDG + PW  $98.72\pm 0.25$  $69.90\pm 1.05$  $34.22\pm 0.67$  $89.38\pm 0.92$  $78.88\pm 0.92$ 
RDG + DTS  $99.14\pm 0.06$  $70.85\pm 0.48$  $41.00\pm 0.35$  $92.27\pm 0.31$  $84.89\pm 0.81$ 
RDG + PW + DTS  $99.13\pm 0.06$  $71.43\pm 0.58$  $41.24\pm 0.45$  $92.29\pm 0.33$  $84.88\pm 0.82$ 
Before analyzing the performance of neural random forest imitation, we present an ablation study for two strategies that have been presented: path weighting (PW) and decision tree subset (DTS). The factor ${p}_{\text{forest}}$ sets the size of the subset of all decision trees and weight ${w}_{\text{path}}$ prioritizes the current routing by weighting the corresponding decision. Please refer to Section 4.1 for details. For generating a wide variety of data, we sample the corresponding values for each data sample individually:
${w}_{\text{path}}$  $\sim 1+\mathcal{N}(0,5)$  (8)  
${p}_{\text{forest}}$  $\sim U(0,1).$  (9) 
The parameters have been determined empirically.
In the following experiment, we evaluate the effects of the path weighting and decision tree subset strategy. For each setting, $1000$ examples are generated and the predicted confidences of the ground truth class by the random forest are analyzed. The distribution of the confidences is visualized in Fig. 2 for CIFAR10. The raw data generation (RDG) without path weighting and decision tree subset generates a limited variety of data which are all predicted with a confidence of around $0.8$. It is also conspicuous that no data samples with very high confidence are created due to contradictions of multiple decision trees. Path weighting (PW) reduces these contradictions by prioritizing routes that are already set up and generates examples with higher confidence. Decision tree subset (DTS) also generates examples where the random forest is less confident and has a much larger distribution. The combination of both strategies shows the most uniformly distributed data samples. This leads to the desired result that the variance in the generated data is as large as possible.
In the next step, we measure the impact on the performance by applying neural random forest imitation for each setting. The results are shown in. Overall, adding PW or DTS increases the accuracy on the test dataset with DTS showing a greater improvement. When combining PW and DTS, the accuracy further increases slightly on average. For MNIST and Caltech101 it decreases marginally. In the following the last setting is used.
The architecture of the network can be selected freely as long as the input and output shape are valid. In the following, a network with $64$ neurons in the first and second hidden layer will be denoted as NN6464. For data sampling, ${c}_{\text{std}}$ is set to $3$ and ${p}_{\text{zero}}$ to the average number of zero values in the training features. Stochastic Gradient Descent with a learning rate of $0.01$ and momentum of $0.9$ is used for training. The batch size is set to $32$ and $100$ steps are performed per epoch. Overall, we train for $100$ epochs and select the model with minimum loss on the trainingvalidation set.
6 Results
We perform several experiments to analyze the capability of neural random forests imitation and compare our method to stateoftheart methods.
6.1 Accuracy
The proposed method generates data from a random forest and trains a neural network that imitates the random forest. The goal is that the neural network approximates the same function as the random forest. This also implies that the network reaches the same accuracy if successful.
We analyze the performance by training a random forest for each dataset and evaluating neural random forest imitation with different network architectures. A variety of network architectures with different depth, number of neurons, and additional layers such as Dropout have been studied. In this paper, we focus on twohiddenlayers with equal numbers of neurons in both layers for clarity. The results are shown in Fig. 3. The red dashed line indicates the accuracy of the random forest. The average accuracy and standard deviation on the test dataset with respect to the network size, i.e. the number of weights, are visualized in blue for different architectures. Most importantly, the analysis shows that the accuracy of the neural networks trained by NRFI reaches the accuracy of the random forest for all datasets. The proposed method for generating labeled data from a random forest by analyzing the decision boundaries can be successfully used for training a network that imitates the random forest. In case of CIFAR100, for example, this is achieved with a twohiddenlayer network with $64$ neurons in both layers. Interestingly, neural random forest imitation exceeds the original accuracy without any additional training data. Each decision tree overfits the training data during training without regularization. In NRFI, the optimization is constrained to find an efficient representation and is able to avoid overfitting due to the limited capacity of the network and onthefly generation of training data. In general, the experiment shows that the accuracy increases with increasing network size and NRFI is robust to different network architectures. NRFI is capable of generating a large variety of unique examples from random forests which have originally been trained on limited amount of data.
6.2 Comparison with StateoftheArt
We compare our method to stateoftheart methods presented by Sethi [Sethi1990RFNN] and Welbl [welbl2014casting, Biau2018] as well as Massiceti et al. [massiceti2017random]. Both methods are direct mappings, i.e. the neural network performs almost the same function as the random forest and the performance does not change. The latter method depends on the depth of the subtrees for splitting the decision trees. We optimize the hyperparameter by evaluating all possible values.
The results are shown in Fig. 1. For each dataset, the size of the generated networks are plotted. Network splitting [massiceti2017random] slightly improves the size of the networks for some datasets compared to Welbl et al. Due to asymmetric decision trees and a large input size, the overhead quickly exceeds the advantages of splitting the decision trees. Overall, both methods are limited to very simple random forests with only a few decision trees and limited depth as the size of the generated networks grow exponentially. NRFI introduces imitation instead of mapping. Our method, in comparison, significantly reduces the size of the generated networks while reaching the same or even improve accuracy as shown in the previous section. On CIFAR10, for example, a network with $32$ neurons in both layers is sufficient and we decrease the size of the network from $27.2$ million to $0.1$ million parameters. As a result, NRFI enables the transformation of very complex classifiers into neural networks.
6.3 Scalability
In this experiment, we analyze the scalability of stateoftheart methods and neural random forest imitation. For that, random forests with limited depth are trained and transformed into neural networks.
The size of the generated networks with respect to the maximum depth of the decision trees is presented in Fig. 4. The blue plot indicates the size of the network using the best performing stateoftheart method. We analyze different architectures and highlight networks with sufficient capacity to achieve the original accuracy of the random forest (orange plot). All experiments are repeated ten times and the average accuracy is shown. It should be noted that the size of the network is plotted in logarithmic scale for better visualization. Stateoftheart methods do not scale due to the exponentially growing number of nodes with increasing depth of the decision trees. Neural random forest imitation is capable of generating very efficient networks even with increasing depth of the decision trees and exponentially growing number of nodes. This results from the underlying strategy of data generation and imitation that has been proposed in NRFI. For some datasets and very small maximum depth, the accuracy cannot be reached most likely due to the mixed class distributions in the leafs.
6.4 Robustness
Random forests randomly select a subset of features and training example for each decision tree. As a result, each random forest is different and the performance varies. To analyze the robustness of neural random forest imitation, we train multiple random forests and transform each random forest using different architectures.
The results on Caltech101 are presented in Fig. 5. The average accuracy and standard deviation is $84.22\%\pm 0.8$. For all generated networks, the accuracy is plotted relative to the accuracy of the random forest with respect to the network size. The thicker plot shows the average performance for each network architecture. The results demonstrate that NRFI robustly imitates different random forests and converges stably with slightly improved accuracy.
7 Conclusion
In this paper, we presented a novel method for transforming random forests into neural networks. Instead of a direct mapping, we introduced a process for generating data from random forests by analyzing the decision boundaries and guided routing of data samples to selected leaf nodes. Additionally, we propose strategies for prioritizing already used routes to minimize conflicts and selecting decision trees subsets to generate a wide range of different examples. Based on the generated data and corresponding labels, a network is trained that imitates the random forest.
Experiments on several realworld benchmark datasets demonstrated that NRFI is capable of learning the decision boundaries very efficient. Compared to stateoftheart methods, our approach significantly reduces the size of the networks (e.g. $0.3$ million instead of $694.5$ million parameters on CIFAR100 or $4.2$ million instead of $4061.7$ million parameters on Caltech101) while achieving the same or even improved accuracy due to better generalization.
Our approach has shown that it scales very well and is able to transform highly complex classifiers. While we focus on image classification in this paper, the method can easily be extended to other tasks. As a result, we enable a large number of possible applications, for instance, initializing neural networks, endtoend finetuning, or training of neural networks with very little labeled data.