Neural Random Forest Imitation

  • 2019-11-25 11:04:30
  • Christoph Reinders, Bodo Rosenhahn
  • 29

Abstract

We present Neural Random Forest Imitation - a novel approach for transformingrandom forests into neural networks. Existing methods produce very inefficientarchitectures and do not scale. In this paper, we introduce a new method forgenerating data from a random forest and learning a neural network thatimitates it. Without any additional training data, this transformation createsvery efficient neural networks that learn the decision boundaries of a randomforest. The generated model is fully differentiable and can be combined withthe feature extraction in a single pipeline enabling further end-to-endprocessing. Experiments on several real-world benchmark datasets demonstrateoutstanding performance in terms of scalability, accuracy, and learning withvery few training examples. Compared to state-of-the-art mappings, wesignificantly reduce the network size while achieving the same or even improvedaccuracy due to better generalization.

 

Quick Read (beta)

Neural Random Forest Imitation

Christoph Reinders and Bodo Rosenhahn
Leibniz University Hannover, Germany
{reinders, rosenhahn}@tnt.uni-hannover.de
Abstract

We present Neural Random Forest Imitation – a novel approach for transforming random forests into neural networks. Existing methods produce very inefficient architectures and do not scale. In this paper, we introduce a new method for generating data from a random forest and learning a neural network that imitates it. Without any additional training data, this transformation creates very efficient neural networks that learn the decision boundaries of a random forest. The generated model is fully differentiable and can be combined with the feature extraction in a single pipeline enabling further end-to-end processing. Experiments on several real-world benchmark datasets demonstrate outstanding performance in terms of scalability, accuracy, and learning with very few training examples. Compared to state-of-the-art mappings, we significantly reduce the network size while achieving the same or even improved accuracy due to better generalization.

1 Introduction

Figure 1: Comparison of state-of-the-art and our proposed method for transforming random forests into neural networks. Neural random forest imitation reduces the size of the network significantly while reaching the same or even improved accuracy due to better generalization. The network size is shown in logarithmic scale.

In recent years, convolutional neural networks have become very popular in various tasks such as image classification [Krizhevsky2012AlexNet, Simonyan15], object detection [liu2016ssd, redmon2018yolov3, ren2015faster], or semantic segmentation [Zhao2017PSPNet, RFB15a]. Researchers have published many datasets such as CIFAR-10 [Krizhevsky2012CIFAR] and ImageNet [imagenet_cvpr09] for training neural networks and put enormous effort to provide labels for each individual image. CIFAR-10 has 60 000 labeled examples and ImageNet over 14 million labeled examples. These datasets are a great basis for training and comparing algorithms on a specific benchmark dataset.

For real-world applications, the dependency on large amounts of labeled data represents a major limitation. Frequently, there is little or even no labeled data for a particular task and hundreds or thousands of examples have to be collected and annotated. This particularly affects new applications and rare categories (e.g rare animals). Transfer learning and regularization methods are usually applied to reduce overfitting. However, for learning with little data, the networks still have a huge number of parameters that have to be fine-tuned – even if just the last layers are trained.

The combination of convolutional neural networks brings both worlds together. CNNs have demonstrated great performance in learning expressive feature representations, but require large amount of training data. RFs are very good in learning with very little data without overfitting. The advantages of mapping random forests into neural networks are three-fold: (1) All components can be combined in a single pipeline and existing high-performance deep learning frameworks can be used directly. This accelerates the processing and enables readily GPU processing. (2) The mapping initializes neural networks with very few training examples enabling a warm-start. (3) The resulting network can be further fine-tuned end-to-end jointly optimizing the features as well as the classifier.

State-of-the-art methods for mapping random forest into neural networks [Sethi1990RFNN, welbl2014casting, massiceti2017random], create a two-hidden-layer neural network by adding a neuron for each split node and each leaf node of the decision trees. The size of the networks become very large as the number of nodes grows exponentially with increasing depth of the decision trees. Additionally, many weights are set to zero, so that an inefficient representation is created. Due to both reasons, the mappings do not scale and are only applicable to simple random forests.

In this work, we present a transformation of random forests into neural networks which creates a very efficient neural network. We introduce a method for generating data from a random forest which creates any amount of input data and corresponding labels. With this data, a neural network is trained that learns to imitate the random forest. Experiments demonstrate that the accuracy of the imitating neural network is equal to the original accuracy or even outperforms the original random forest due to better generalization while being significant smaller (see Fig. 1).

To summarize, our contributions are:

  • We propose a novel method for transforming random forests into neural networks by generating data from a random forest and training an imitating neural network. Labeled data samples are created by evaluating the decision boundaries and guided routing to selected leaf nodes.

  • Our transformation is scalable to complex classifiers and deep random forests.

  • We enable learning and initializing neural networks with very little data.

  • Convolutional neural networks and random forests can be combined in an fully differentiable, end-to-end pipeline for acceleration and further fine-tuning.

  • We will publish our source code upon acceptance and provide an easy-to-use toolbox for imitating random forest with neural networks.

2 Related Work

Convolutional Neural Networks

Convolutional neural networks have a long history [SCHMIDHUBER201585] and started with handwriting recognition as one of the first applications [LeCun:1989:BAH:1351079.1351090]. Publicly available image datasets such as ImageNet [imagenet_cvpr09] contributed greatly to the success. Krizhevsky et al. [Krizhevsky2012AlexNet] played an important role proposing a neural network called AlexNet which reduces the top-5 test error on ImageNet from 26.2% to 15.3%. The architecture consists of multiple alternating convolutional layers and activation functions (optionally followed by dropout or pooling layers) to generate features followed by fully-connected layers. In the following years a series of improvements have been proposed. Simonyan and Zisserman [Simonyan15] further decreased the top-5 test error to 7.3% by introducing a deeper but simplified architecture based on 3×3 filter kernels. More complex structure such as GoogLeNet [Szegedy15] use Inception modules for multi-scale processing. ResNet [He2016DeepRL] introduce residual connections for training much deeper neural networks without the problem of vanishing gradients. Huang et al. [Huang_2017_CVPR] further increase the connectivity with densely connected layers.

Transfer Learning

Learning in one domain and transferring the learned knowledge to another domain has become a standard technique. Convolutional neural networks have demonstrated great performance in learning feature representations [Agrawal2014, pmlr-v32-donahue14]. The transition from general to specific convolution layers and their transferability has been studied by Yosinski et al. [Yosinski:2014:TFD:2969033.2969197]. Huh et al. [DBLP:journals/corr/HuhAE16] perform an in-depth analysis of ImageNet features. Li et al. [Li2016] propose a method for fine-tuning without forgetting original learned capabilities.

Combining CNNs and RFs

Random forests [breiman2001random] are an ensemble learning method for classification and regression tasks. Sethi [Sethi1990RFNN] and Welbl [welbl2014casting, Biau2018] presented a technique for mapping random forests to neural networks. The direct mapping creates a two-hidden-layer neural network by adding a neuron in the first hidden layer for every split node and a neuron in the second hidden layer for every leaf node. The weights and biases are set according to the parameters of the decision trees. Massiceti et al. [massiceti2017random] extended this approach and introduced a network splitting strategy by dividing each decision tree into multiple subtrees of depth r~.

These techniques, however, are only applicable for trees of limited depth. As the number of nodes grows exponentially with increasing depth of the trees, a very inefficient representation is created causing a very high memory consumption. In this paper, we address this issue by proposing a much more efficient transformation.

3 Background and Notation

Random forests [breiman2001random] are an ensemble method consisting of multiple decision trees. In this section, we briefly describe decision trees, random forests and the notation used throughout this paper.

Decision Trees

Decision trees consist of split nodes NSplit and leaf nodes NLeaf. Each split node sNSplit performs a split decision and routes a data sample x to the left or right child node, denoted as cleft and cright(s), respectively. When using axis-aligned split decisions, a single feature f(s) and a threshold θ(s) are the bases for the split. If the value of feature f(s) is smaller than θ(s), the data sample is routed to the left child node and otherwise to the right child node, denoted as

xcleft(s) xf(s)<θ(s) (1)
xcright(s) xf(s)θ(s). (2)

Data samples are routed through a decision tree until a leaf node lNLeaf is reached which store the target value. For the classification task, these are the class probabilities Pleaf(l)=(p1l,,pCl), where C is the number of classes.

Random Forests

A single decision tree is very fast and operates on high-dimensional data, however, tends to overfit the training data by constructing a deep tree which perfectly separates all training examples. While having a very small training error, this easily results in a large test error.

Random forests address this problem by learning an ensemble of nT decision trees. Each tree is trained with a random subset of training examples and features. The prediction RF(x) for a random forest is calculated by averaging the predictions of all decision trees.

4 Neural Random Forest Imitation

In this section, a novel approach called Neural Random Forest Imitation (NRFI) for transforming random forests into neural networks is presented. The main concepts include (1) generating training data from decision trees and random forests, (2) adding strategies for reducing conflicts and increasing the variety of the generated examples, and (3) training a neural network that imitates the random forest by learning the decision boundaries. As a result, NRFI enables the transformation of random forest into very efficient neural networks.

4.1 Data Generation

First, we propose a method for generating data from a random forest. A data sample xN is a N-dimensional vector where N is the number of features. We select a target class t[1,,C] and adjust the data sample so that it is classified as such. For that, the decision boundaries from the root nodes to the leaf nodes are evaluated and the data is modified so that it is routed to selected leaf nodes.

4.1.1 Data Initialization

A data sample x is initialized randomly. Let fmean,fstdN be the mean and standard deviation per feature. We sample x𝒩(fmean,cstdfstd), where the standard deviation is multiplied by a factor cstd1 to increase the distribution of the data samples. The data is clipped to the feature minimum fmin and maximum fmax.

In the next step, elements of x are randomly set to zero, each with a probability of pzero. This strategy is especially designed for features that are extract from activation functions like ReLU where a larger number of elements is zero.

4.1.2 Data Generation from Decision Tree

A decision tree processes an input vector x by routing the data through the tree until a leaf is reached. At each node, a split decision is evaluated and the input is passed to the left child node or to the right child node. Finally, a leaf l is reached which stores the probabilities pl=(p1l,,pCl) for each class.

We reverse this process and start with a method for generating training data from a decision tree. The pseudo code is presented in Algorithm 4.1.2. First, the class weights W(n)=(w1n,,wCn) for every node n are defined:

W(n)={Pleaf(n)nNLeafW(cleft(n))+W(cright(n))nNSplit (3)

For every leaf node, the class weights are equal to the stored probabilities in the leaf. For every split node, the class weights in the child nodes are summed up.

The aim of the algorithm is to generate a data sample for target class t. For that, characteristics of the target class are successively added to the data sample. Starting at the root node, we modify the input data is so that it is routed though selected split nodes until a leaf node is reached. For learning the decision boundaries, a new value is sampled around the threshold θ(n) of the current node n with standard deviation fstd,f(n) when a feature is used for the first time:

xf(n)𝒩(θ(n),fstd,f(n)). (4)

The routing is guided based on the weights for the target class in the left child node wleft=wtcleft(n) and right child node wright=wtcright(n). The weights are normalized using L2-norm denoted as p^left and p^right. Afterwards, the left or right child node is randomly selected with p^left and p^right, respectively, as the next node nnext.

In the next step, we verify that the data sample is routed to the selected child node by evaluating the split decision. A split node s routes the data to the left or right child node based on a split feature f(s) and a threshold θ(s). If the value of the split feature xf(s) is smaller than θ(s), the data sample is routed to the left child node and otherwise to the right child node. The data sample is modified if it is not already routed to the selected child node by assigning a new value. If the selected child node is the left child node, the value has to be smaller than the threshold θ(s) and a new value within the minimum feature value fmin,f(n) and θ(s) is randomly sampled:

xf(n)U(fmin,f(n),θ(n)). (5)

If the data sample is supposed to be routed to the right child node, the new value is randomly sampled between θ(s) and the maximum feature value fmax,f(n):

xf(n)U(θ(n),fmax,f(n)). (6)

This process is repeated until a leaf node is reached. In each node characteristics are added that classify the data sample as the target class.

During this process, modifications can conflict with previous decisions because features are used multiple times within a decision tree or across multiple decision trees. We present a strategy for minimizing conflicting routings denoted as path weighting (PW). The current routing – either wleft or wright – is weighted with a factor wpath1 to prioritize the path and do not change the data sample if possible. This effect is particularly interesting when random forests are processed in the next section.

Overall, the presented method enables the generation of data samples and corresponding labels from a decision tree without adding any further data.

{algorithm}

[t] DataGenerationFromTree
Generate data samples from a decision tree
Input: Data sample x, target class t, as well as feature
minimum fmin, maximum fmax, and standard deviation fstd
Output: Data sample for target class t {algorithmic}[1] \Staten root node \WhilenNLeaf \Iffeature f(n) is used for the first time \Statexf(n)𝒩(θ(n),fstd,f(n)) \EndIf\Statewleftwtcleft(n) \Statewrightwtcright(n) \Iffeature f(n) is already used \Stateweight current route with wpath \EndIf\Statep^left,p^right normalize wleft and wright \Statennext randomly select left or right child node with probability of p^left and p^right, respectively \Ifnnext=cleft(n) \Ifxf(n)θ(n) \Statexf(n)U(fmin,f(n),θ(n)) \EndIf\Else \Ifxf(n)<θ(n) \Statexf(n)U(θ(n),fmax,f(n)) \EndIf\EndIf\Statemark feature f(n) as used \Statennnext \EndWhile\State\Returnx

4.1.3 Data Generation from Random Forest

In the next step, we extend the method to generate data from random forests. Random forests consists of nT decision trees RF={T1,,TnT}. For generating a data sample x, the presented method for a single decision tree is applied to multiple decision trees consecutively. In each decision tree, the data sample is modified and routed to selected nodes based on the target class t. When using all decision trees, data samples are created where all trees agree with a high probability.

We introduce a strategy called decision tree subset (DTS) for generating examples with varying confidence, i.e. the predictions of the individual decision trees diverge. To achieve this, a factor pforest[0,1] is defined determining the size of a subset of decision trees nsub=nTpforest. A subset of nsub decision trees is randomly selected:

RFsubRF with |RFsub|=nsub. (7)

All decision trees in RFsub are processed in random order to generate a data sample. For each decision tree the presented method modifies the data sample based on the target class. Finally, the output of the random forest y=RF(x) is predicted. In most cases, the prediction matches the target class. However, due to multiple factors such as the stochastic process, a small subset size, or varying predictions of the decision trees, it can also be different. As a result, an input-target pair (x,y) has been created showing similar characteristics as the target class and any amount of data can be generated by repeating this process.

4.2 Imitation Learning

Finally, a neural network that imitates the random forest is trained. The network learns the decision boundaries from the generated data and approximates the same function as the random forest. The network architecture is based on a fully-connected network with one or multiple hidden layers. The dimension of the input and output data is the same as of the random forest, i.e. a N-dimensional input and C-dimensional output. Each hidden layer is followed by a ReLU activation function [Nair:2010:ReLU]. The last fully connected layer is using a softmax activation function.

For training, we generate input-target pairs (x,y) as described in the last section. Additionally, the data samples x are normalized by subtracting the mean fmean and subsequently dividing by the absolute maximum fabsmax per feature. These training examples are feed into the training process to teach the network predicting the same results as the random forest. To avoid overfitting, the data is generated on-the-fly so that each training example is unique. In this way, we learn an efficient representation of the decision boundaries and are able to transform random forest into neural networks.

5 Experiments

In this section, we present the experimental setup including the datasets, feature learning, training of the random forests, and neural random forest imitation. All experiments are performed on a NVIDIA GeForce GTX 1080 Ti using Tensorflow [tensorflow2015-whitepaper] and Keras [chollet2015keras].

5.1 Datasets

We evaluate the proposed method on five standard benchmark datasets. These datasets cover a wide range of different image classification tasks. MNIST [lecun-mnisthandwrittendigit-2010] is a dataset of handwritten digits. CIFAR-10 and CIFAR-100 [Krizhevsky2012CIFAR] contain natural RGB images of size 32×32 over 10 classes such as airplane, cat, or ship. GTSRB [Stallkamp2012] is a dataset for traffic sign classification. Caltech101 [Fei-Fei:2007] consists of images in 101 categories and an additional background category.

Following [Yosinski:2014:TFD:2969033.2969197], each dataset is divided into two non-overlapping subsets, denoted as DS and DT. Both subsets contain half of the classes and approximately half of the data. Each subset is split into 60% for training, 20% for validation, and 20% for testing, if not already provided.

Finally, the number of training and validation examples of the target dataset is limited to nlimit examples per class. In our experiments we set nlimit to 20. As a result, two datasets for transfer learning are created: A source dataset DS which has large amount of labeled data and is used for feature learning and a target dataset DT which has very few labeled training examples.

5.2 Feature Learning

For feature learning, we train a ResNet-50 [He2016DeepRL] pretrained on ImageNet [imagenet_cvpr09]. The input images are resized to 32×32 for MNIST, CIFAR-10, and CIFAR-100, 48×48 for GTSRB, and 128×128 for Caltech101. It should be noted that the images from MNIST are slightly upscaled because the minimum input size of the feature network is 32×32.

Stochastic Gradient Descent is used for optimizing the network with learning rate of 0.001, momentum of 0.9, and weight decay of 10-6. The batch size is set to 32 and the input data is normalized to [-0.5,0.5]. Augmentation is used by randomly applying rotation, shifting, shearing, zooming, and flipping (except for MNIST and GTSRB). The network is trained for 200 epochs. After each epoch, we evaluate the performance on the validation set and select the model with minimum loss.

5.3 Feature Extraction

Based on the learned features, the network is used to extract a feature representation for every example from the target dataset DT. To determine the best layer for extracting the features for each dataset, the performance on the validation dataset over all activation layers is evaluated. For each activation layer, we extract features, train a random forest with 100 decision trees on the training set, and evaluate the performance on the validation set. This process is repeated ten times and the results are averaged.

Afterwards, the features for the training, validation, and test set are generated. For the training examples, five different random augmentations are applied to the input image (same methods as for feature learning). At the end, the minimum fmin, maximum fmax, absolute maximum fabsmax, mean fmean, and standard deviation fstd are calculated per feature on the extracted training features.

5.4 Random Forest

Random forests are capable of learning with very little labeled data. We train a random forest with nT decision trees on the extracted training features of the target dataset DT where only limited amount of data is available.

The number of decision trees is determined similar as before by training random forests with different number of decision trees nT{5,10,,100} and evaluating the performance on the validation set. The experiment is repeated ten times. For each dataset, the best number of decision trees is selected based on the average validation accuracy.

5.5 Neural Random Forest Imitation

Figure 2: Probability distribution of the predicted confidences for different data generation settings. The combination of path weighting (PW) and decision tree subset (DTS) generates the most evenly distributed data.
Method MNIST [%] CIFAR-10 [%] CIFAR-100 [%] GTSRB [%] Caltech101 [%]
RDG 98.86±0.19 69.69±0.63 33.61±0.95 89.66±0.82 78.63±0.91
RDG + PW 98.72±0.25 69.90±1.05 34.22±0.67 89.38±0.92 78.88±0.92
RDG + DTS 99.14±0.06 70.85±0.48 41.00±0.35 92.27±0.31 84.89±0.81
RDG + PW + DTS 99.13±0.06 71.43±0.58 41.24±0.45 92.29±0.33 84.88±0.82
Table 1: Evaluation of raw data generation (RDG), path weighting (PW), decision treee subset (DTS), and combined strategies on different datasets. For each dataset and method, the accuracy on the test dataset and standard deviation is shown.

Before analyzing the performance of neural random forest imitation, we present an ablation study for two strategies that have been presented: path weighting (PW) and decision tree subset (DTS). The factor pforest sets the size of the subset of all decision trees and weight wpath prioritizes the current routing by weighting the corresponding decision. Please refer to Section 4.1 for details. For generating a wide variety of data, we sample the corresponding values for each data sample individually:

wpath 1+|𝒩(0,5)| (8)
pforest U(0,1). (9)

The parameters have been determined empirically.

In the following experiment, we evaluate the effects of the path weighting and decision tree subset strategy. For each setting, 1000 examples are generated and the predicted confidences of the ground truth class by the random forest are analyzed. The distribution of the confidences is visualized in Fig. 2 for CIFAR-10. The raw data generation (RDG) without path weighting and decision tree subset generates a limited variety of data which are all predicted with a confidence of around 0.8. It is also conspicuous that no data samples with very high confidence are created due to contradictions of multiple decision trees. Path weighting (PW) reduces these contradictions by prioritizing routes that are already set up and generates examples with higher confidence. Decision tree subset (DTS) also generates examples where the random forest is less confident and has a much larger distribution. The combination of both strategies shows the most uniformly distributed data samples. This leads to the desired result that the variance in the generated data is as large as possible.

In the next step, we measure the impact on the performance by applying neural random forest imitation for each setting. The results are shown in. Overall, adding PW or DTS increases the accuracy on the test dataset with DTS showing a greater improvement. When combining PW and DTS, the accuracy further increases slightly on average. For MNIST and Caltech101 it decreases marginally. In the following the last setting is used.

The architecture of the network can be selected freely as long as the input and output shape are valid. In the following, a network with 64 neurons in the first and second hidden layer will be denoted as NN-64-64. For data sampling, cstd is set to 3 and pzero to the average number of zero values in the training features. Stochastic Gradient Descent with a learning rate of 0.01 and momentum of 0.9 is used for training. The batch size is set to 32 and 100 steps are performed per epoch. Overall, we train for 100 epochs and select the model with minimum loss on the training-validation set.

6 Results

We perform several experiments to analyze the capability of neural random forests imitation and compare our method to state-of-the-art methods.

6.1 Accuracy

(a) MNIST
(b) CIFAR-10
(c) CIFAR-100
(d) GTSRB
(e) Caltech101
Figure 3: Accuracy on the test dataset with respect to the network size. The red dashed line indicates the accuracy of the random forest. With increasing network capacity, NRFI is capable of imitating and even outperforming the random forest.

The proposed method generates data from a random forest and trains a neural network that imitates the random forest. The goal is that the neural network approximates the same function as the random forest. This also implies that the network reaches the same accuracy if successful.

We analyze the performance by training a random forest for each dataset and evaluating neural random forest imitation with different network architectures. A variety of network architectures with different depth, number of neurons, and additional layers such as Dropout have been studied. In this paper, we focus on two-hidden-layers with equal numbers of neurons in both layers for clarity. The results are shown in Fig. 3. The red dashed line indicates the accuracy of the random forest. The average accuracy and standard deviation on the test dataset with respect to the network size, i.e. the number of weights, are visualized in blue for different architectures. Most importantly, the analysis shows that the accuracy of the neural networks trained by NRFI reaches the accuracy of the random forest for all datasets. The proposed method for generating labeled data from a random forest by analyzing the decision boundaries can be successfully used for training a network that imitates the random forest. In case of CIFAR-100, for example, this is achieved with a two-hidden-layer network with 64 neurons in both layers. Interestingly, neural random forest imitation exceeds the original accuracy without any additional training data. Each decision tree overfits the training data during training without regularization. In NRFI, the optimization is constrained to find an efficient representation and is able to avoid overfitting due to the limited capacity of the network and on-the-fly generation of training data. In general, the experiment shows that the accuracy increases with increasing network size and NRFI is robust to different network architectures. NRFI is capable of generating a large variety of unique examples from random forests which have originally been trained on limited amount of data.

6.2 Comparison with State-of-the-Art

We compare our method to state-of-the-art methods presented by Sethi [Sethi1990RFNN] and Welbl [welbl2014casting, Biau2018] as well as Massiceti et al. [massiceti2017random]. Both methods are direct mappings, i.e. the neural network performs almost the same function as the random forest and the performance does not change. The latter method depends on the depth of the subtrees for splitting the decision trees. We optimize the hyperparameter by evaluating all possible values.

The results are shown in Fig. 1. For each dataset, the size of the generated networks are plotted. Network splitting [massiceti2017random] slightly improves the size of the networks for some datasets compared to Welbl et al. Due to asymmetric decision trees and a large input size, the overhead quickly exceeds the advantages of splitting the decision trees. Overall, both methods are limited to very simple random forests with only a few decision trees and limited depth as the size of the generated networks grow exponentially. NRFI introduces imitation instead of mapping. Our method, in comparison, significantly reduces the size of the generated networks while reaching the same or even improve accuracy as shown in the previous section. On CIFAR-10, for example, a network with 32 neurons in both layers is sufficient and we decrease the size of the network from 27.2 million to 0.1 million parameters. As a result, NRFI enables the transformation of very complex classifiers into neural networks.

6.3 Scalability

(a) MNIST
(b) CIFAR-10
(c) CIFAR-100
(d) GTSRB
(e) Caltech101
Figure 4: Analyzing the scalability of current state-of-the-art methods (blue) and neural random forest imitation (orange). For each network, the accuracy is shown and filled points indicate an accuracy greater or equal than the original accuracy.

In this experiment, we analyze the scalability of state-of-the-art methods and neural random forest imitation. For that, random forests with limited depth are trained and transformed into neural networks.

The size of the generated networks with respect to the maximum depth of the decision trees is presented in Fig. 4. The blue plot indicates the size of the network using the best performing state-of-the-art method. We analyze different architectures and highlight networks with sufficient capacity to achieve the original accuracy of the random forest (orange plot). All experiments are repeated ten times and the average accuracy is shown. It should be noted that the size of the network is plotted in logarithmic scale for better visualization. State-of-the-art methods do not scale due to the exponentially growing number of nodes with increasing depth of the decision trees. Neural random forest imitation is capable of generating very efficient networks even with increasing depth of the decision trees and exponentially growing number of nodes. This results from the underlying strategy of data generation and imitation that has been proposed in NRFI. For some datasets and very small maximum depth, the accuracy cannot be reached most likely due to the mixed class distributions in the leafs.

6.4 Robustness

Figure 5: Transformation of different random forest using neural random forest imitation to analyze the robustness.

Random forests randomly select a subset of features and training example for each decision tree. As a result, each random forest is different and the performance varies. To analyze the robustness of neural random forest imitation, we train multiple random forests and transform each random forest using different architectures.

The results on Caltech101 are presented in Fig. 5. The average accuracy and standard deviation is 84.22%±0.8. For all generated networks, the accuracy is plotted relative to the accuracy of the random forest with respect to the network size. The thicker plot shows the average performance for each network architecture. The results demonstrate that NRFI robustly imitates different random forests and converges stably with slightly improved accuracy.

7 Conclusion

In this paper, we presented a novel method for transforming random forests into neural networks. Instead of a direct mapping, we introduced a process for generating data from random forests by analyzing the decision boundaries and guided routing of data samples to selected leaf nodes. Additionally, we propose strategies for prioritizing already used routes to minimize conflicts and selecting decision trees subsets to generate a wide range of different examples. Based on the generated data and corresponding labels, a network is trained that imitates the random forest.

Experiments on several real-world benchmark datasets demonstrated that NRFI is capable of learning the decision boundaries very efficient. Compared to state-of-the-art methods, our approach significantly reduces the size of the networks (e.g. 0.3 million instead of 694.5 million parameters on CIFAR-100 or 4.2 million instead of 4061.7 million parameters on Caltech101) while achieving the same or even improved accuracy due to better generalization.

Our approach has shown that it scales very well and is able to transform highly complex classifiers. While we focus on image classification in this paper, the method can easily be extended to other tasks. As a result, we enable a large number of possible applications, for instance, initializing neural networks, end-to-end finetuning, or training of neural networks with very little labeled data.

References