ReinBo: Machine Learning pipeline search and configuration with Bayesian Optimization embedded Reinforcement Learning

  • 2019-04-10 18:26:16
  • Xudong Sun, Jiali Lin, Bernd Bischl
  • 12


Machine learning pipeline potentially consists of several stages ofoperations like data preprocessing, feature engineering and machine learningmodel training. Each operation has a set of hyper-parameters, which can becomeirrelevant for the pipeline when the operation is not selected. This gives riseto a hierarchical conditional hyper-parameter space. To optimize this mixedcontinuous and discrete conditional hierarchical hyper-parameter space, wepropose an efficient pipeline search and configuration algorithm which combinesthe power of Reinforcement Learning and Bayesian Optimization. Empiricalresults show that our method performs favorably compared to state of the artmethods like Auto-sklearn , TPOT, Tree Parzen Window, and Random Search.


Quick Read (beta)

ReinBo: Machine Learning pipeline search and configuration with Bayesian Optimization embedded Reinforcement Learning

Xudong Sun equal contribution.LMU Munich    Jiali Lin* LMU Munich    Bernd Bischl LMU Munich

Machine learning pipeline potentially consists of several stages of operations like data preprocessing, feature engineering and machine learning model training. Each operation has a set of hyper-parameters, which can become irrelevant for the pipeline when the operation is not selected. This gives rise to a hierarchical conditional hyper-parameter space. To optimize this mixed continuous and discrete conditional hierarchical hyper-parameter space, we propose an efficient pipeline search and configuration algorithm which combines the power of Reinforcement Learning and Bayesian Optimization. Empirical results show that our method performs favorably compared to state of the art methods like Auto-sklearn , TPOT, Tree Parzen Window, and Random Search.

Bayesian Optimization Reinforcement Learning AutoML

shapes,backgrounds \usetikzlibraryarrows,positioning \usetikzlibrarytrees \usetikzlibrarychains \usetikzlibraryarrows,petri,topaths\usetikzlibraryshadows

1 Introduction

Over the past years, Machine Learning (ML) has achieved remarkable success in a wide range of application areas, which has greatly increased the demand for machine learning systems. However, an efficient machine learning algorithm crucially depends on a human expert, who has to carefully design the pipeline of the machine learning system, potentially consisting of data pre-process, feature filtering, machine learning algorithm selection, as well as identifying a good set of hyper-parameters. As there are a large number of possible alternatives of models as well as hyper-parameters, the need for automated machine learning (AutoML) has been growing, which is supposed to automatically generate a data analysis pipeline with machine learning methods and parameter settings that are optimized for a given data set, in order to make machine learning methods available for non-expert users.

Since hyper-parameters of a machine learning model have a large influence on the performance of the model, hyper-parameter optimization becomes a critical part of an AutoML system. Popular hyper-parameter optimization methods include Sequential Bayesian Optimization, which iterates between fitting surrogate models that predict model performance, and using them to make choices about which configurations to investigate.

However, the composition of the machine learning pipelines also plays a vital role in the performance of AutoML systems. Choosing different data preprocessing or feature engineering techniques as well as choosing different machine learning models for a specific dataset could potentially result in considerable performance differences. The joint optimization of the pipeline search and its associated hyper-parameters configuration could essentially reside under the umbrella of Combined Algorithm Selection and Hyperparameter optimization (CASH) problem [30], where Algorithm corresponds to the pipeline and Configuration corresponds to the hyper-parameters associated with the pipeline. The pipelines and hyper-parameters reside in a conditional hierarchical space, where some hyper-parameters only become valid when the corresponding pipeline is present. For example, Figure 1 illustrates such a situation when the data preprocessing and feature engineering operations are selected, which correspond to an incomplete pipeline, one out of three machine learning algorithms need to be chosen (indicated by dashed edges) to complete the pipeline, the corresponding hyper-parameters (indicated by solid edges) of an algorithm only become valid when the algorithm is selected.


[sibling distance=10em, every node/.style = shape=rectangle, rounded corners, draw, align=center, top color=white, bottom color=blue!20]]


Data Preprocessing and Feature Engineering Operations selected child[dashed] node xgboost child[solid]node max depth and … child[dashed] node Random Forest child[solid]node mtry and … child[dashed] node Support Vector Machine child[solid] node σ child[solid] node C ;

Figure 1: Example of Conditional Hierarchical Space

To optimize the conditional hyper-parameters space jointly with the pipeline it is attached to, we embed Bayesian Optimization in the Reinforcement Learning process, and dub the method ReinBo, which means Machine Learning Pipeline search and configuration with Reinforcement Learning and Bayesian Optimization. Note that ReinBo can solve not only CASH problems, but also any mixed discrete and continuous conditional hierarchical space optimization, which is left for future work.

Our major contributions are:

  • Inspired by Hierarchical Reinforcement Learning [13], we transform the conditional hierarchical hyper-parameter optimization problem into subtasks of pipeline selection and hyper-parameter optimization, which circumvents the conditional constraint and reduces the search dimension.

  • To our best knowledge, we are the first to embed Bayesian Optimization (BO) into Reinforcement learning, specifically Q Learning [31] for collaborative joint search of pipelines and hyper-parameters, which is different from using BO for policy optimization [12], and also different from using BO for hyper-parameter fine tuning after an optimal pipeline is selected by a reinforcement learning based AutoML framework [32].

  • We provide an open source light weight R language implementation reinbo11 1 for the R Machine Learning community which could run efficiently on a personal computer, and takes much less resources compared to other AutoML softwares.

In the following section, we review related works and discuss the differences to our method. In section 3, we explain our method in detail and also shed light to connections with Hyperband [21]. In section 4, we benchmark our method by comparing it with several state of art methods.

2 Related Work

In this section, we try to classify the current popular AutoML solutions into a taxonomy and discuss the differences of each individual work with ours.

Sequential Model Based Optimization family  Auto-sklearn [15] and Auto-Weka [30] both use Sequential Model-based Algorithm Configuration (SMAC) [17] to solve the Combined Algorithm Selection and Hyperparameter optimization (CASH) problem. SMAC[17] transforms the conditional hierarchical hyper-parameter space into a flat structure by instantiating inactive conditional parameters to default values, which allows the random forest to focus on active hyper-parameters [17]. A potential drawback for this method is that the surrogate model needs to learn in a high dimensional hyper-parameter space, which might need a large sample of observations to be sufficiently trained, while in practice, running machine learning algorithm is usually very expensive. Tree Parzen Window (TPE) [7], however, tackles the conditional hierarchical hyper-parameter space using a tree like Parzen Window to construct two density estimators on top of a tree like hyper-parameter set. Expected improvement induced from lower and upper quantile density estimators is used to select new candidate proposals from points generated by Ancestral Sampling.

Tree-based Genetic Programming  TPOT [24] automatically designs and optimizes machine learning pipelines with a genetic programming [3] algorithm. The machine learning operators are used as genetic programming primitives, which will be combined by tree-based pipelines and the Genetic Programming algorithm is used to evolve tree-based pipelines until the “best” pipeline is found. Similar methods also include Recipe [26]. However, this family of methods does not scale well [23]. In this paper, we aim for an AutoML system that could give a valuable configured pipeline within limited time.

Monte Carlo Tree Search Alike  ML-Plan [23] is an AutoML system, built upon a Hierarchical Task Network, which uses a Monte Carlo Tree Search like algorithm to search for pipelines and also configure the pipeline with hyper-parameters. Task is expanded based on best-first search, where the score is estimated by a randomized depth first search by randomly trying different subtree possibilities on a Hierarchical Task Network. To ensure exploration, ML-Plan gives equal possibility to the starting node in a Hierarchical Task Network and then uses a best-first strategy for searching at the lower part of the network. Potential drawback for this method is that the hyper-parameter space is discretized, which might essentially lose good candidates in continuous spaces since large continuous hyper-parameter spaces would be essentially hard to discretize.

Reinforcement Learning based Neural Network Architecture Search  This family of methods are usually not termed as AutoML systems but rather Neural Architecture Search. For instance, MetaQNN [2] uses Q-learning to search convolutional neural network architectures. The learning agent is trained to sequentially choose CNN layers using Q-learning with an ϵ-greedy exploration strategy and the goal is to maximize the cross-validation accuracy. In [34], instead of using Q-learning, the authors use Recurrent Neural Networks as the reinforcement learning policy approximator to generate variable strings to represent various neural architecture forms. The reward-function is designed to be the validation performance of the constructed network. The reinforcement learning policy is trained with gradient descent algorithm, specifically REINFORCE. The architecture elements being searched are very similar to MetaQNN. Inspired from [34], we also assume the machine learning pipeline to be optimized could be represented by a variable length string, but our work is different from [34] in that we use both Deep Q-learning and Tabular Q-learning. More importantly, compared with both [2] and [34], which optimize the neural architecture, the elements of the architecture are mostly factor variables like layer type or discretized elements like filter size, while in this paper, we deal heavily with continuous hyper-parameters (The C and σ for a rbf kernel Support Vector Machine). To jointly optimize the discrete pipeline choice and associated continuous hyper-parameters, we embed Bayesian Optimization inside our reinforcement learning agent.

Other Reinforcement Learning based methods  In [32], the authors also combine pipeline search and hyper-parameter optimization in a reinforcement learning process based on the PEORL [33] framework, however, the hyper-parameter is randomly sampled during the reinforcement learning process, an extra stage is needed to sweep the hyper-parameters using hyper-parameter optimization techniques, while in our work, hyper-parameter optimization is embedded in the reinforcement learning process. Alpha3M [14] combined MCTS and recurrent neural network in a self play [27] fashion, however, it seems that Alpha3M does not perform better than the state of art AutoML systems. For example, out of all the 6 OpenML datasets they have used to compare with state of art AutoML systems, Alpha3M only shows a clear improvement on 1 dataset (spectf) against Auto-sklearn [15] and TPOT [24], according to Figure 4 in [14]. Furthermore, it is not clear to us how the hyper-parameters are set and if Bayesian Optimization is used. The implementation of Alpha3M takes advantage of the GPUs [14] for the fast performance while our method has a light weight implementation which efficiently runs with CPU and does not necessarily need Neural Networks.

3 Method

3.1 Towards Reinbo


[ scale=0.65, start chain=1 going below, start chain=2 going right, node distance=1mm, desc/.style= scale=0.75, on chain=2, rectangle, rounded corners, draw=black, very thick, text centered, text width=8cm, minimum height=12mm, fill=blue!30 , it/.style= fill=blue!10 , level/.style= scale=0.75, on chain=1, minimum height=12mm, text width=2cm, text centered , every node/.style=font= ]


[level] (Level 3) s1:DataProcess; \node[level](dphp) Data Process HyperPars; \node[level] (Level 2) s2:FeatureEngeering; \node[level] (Level 1.5) FeatureEngineering HyperPar; \node[level] (Level 1) s3:Learner; \node[level] (Level 0) LearnerHyperParameter;


(Level 3); \node[desc, it](impute)Imputation; \node[desc](dpna)NA; \chainin(impute); \node[desc, it, text width=0cm, continue chain=going below] (fehp1) ; \chainin(fehp1); \node[desc,it, continue chain=going below](pca)PCA; \node[desc, continue chain = going right] (anova) Anova; \chainin(pca); \node[desc, it, text width=0cm, continue chain=going below] (fehp2) ; \node[desc] (svm) SVM; \node[desc, it, continue chain=going right] (rf) Random Forest; \node[desc, it, text width=0cm, continue chain=going below] (fehp2) ; \chainin(svm); \node[desc, text width=2cm, xshift=0, continue chain = going below] (sigma) sigma; \node[desc, text width=2cm, xshift=0, continue chain = going right] (c) C; \chainin(anova); \node[desc, text width=3.5cm, xshift=2.25cm, continue chain = going below] (PLC) percentage;

Figure 2: Toy example of selected pipeline and associated hyperparameters

As shown in Figure 2, we assume that a machine learning pipeline potentially consists of 3 stages (s1 through s3 in the figure), which include data preprocessing (imputations, NA and more), feature engineering (Principal Component Analysis for feature transform, Anova for feature filtering and more), and machine learning model (learner like SVM, Random Forest). Specifically, we use operation “NA” to indicate that no operation would be done in the stage in question. Figure 2 just serves as a toy but working example for ReinBo, in practice, there are a lot more operations available. A particular operation has associated hyper-parameters (for instance the percentage of selected features in Anova feature filtering). In Figure 2, dark color filled cells (NA, Anova, SVM) represent selected operations and their associated active hyper-parameters (percentage, sigma, C), while hyper-parameters for inactive operations are not drawn in the figure.

Observing from Figure 2, along with Figure 1, we could think of the pipeline selection and configuration problem as a two-phase process. During the first phase, a planning algorithm guides the agent to choose a path which corresponds to an unconfigured pipeline. This is similar to a multi-armed bandit problem, where each path corresponds to one arm, while difference lies in that the agent can not directly pull a discrete arm but have to pull across several consecutive discrete arm groups (each arm group corresponds to a stage in Figure 2) and the agent only gets reward after choosing one of arms from the last group. The second phase is similar to contextual bandit with continuous action space (corresponding to hyper-parameters), where the context is which path from the first phase has been selected.

We model the first phase as a reinforcement learning episode, where a particular operation in stage i is treated as action ai, taken upon corresponding state si. State si could be represented by actions taken up to the current stage for example. The pipeline search problem is then to find an optimal policy π to decide which operation (action) to take at a particular state. The action value function Q(s,a) at each state tells us how favorable a particular operation is. We use 𝒜si to denote the space of legal actions at state si. Suppose a roll-out of states trajectory for one composition (episode) is s1,,sK, the corresponding space of pipeline could be denoted by i=1K𝒜si, where K is the total number of stages and we use to denote the Cartesian Product. For a more general notation, we use 𝒜(Si) to denote the space of actions when the state is Si at stage i.

We search for potentially better hyperparameters in the second phase with Bayesian Optimization. Aside from the pipeline itself, each concrete operation (action ai) at stage i is configurable by a set of hyper-parameters Φai. Φai can be hyper-parameters set for a preprocessor like the ratio of variance to keep in PCA or hyper-parameters set for a machine learning model like the C and σ hyper-parameter for SVM. Thus a configured pipeline search space would be i=1K𝒜(Si;Φai) where we use Φai to denote the conditional hyper-parameter space at stage i.

The connection point between reinforcement learning and Bayesian Optimization lies in the reward function design in the reinforcement learning part. During the composition process, there is no signal available to judge how good a current uncompleted pipeline is until the final learner (classifier) is configured with hyper-parameters and trained on the data. At the starting point, different pipelines are tried out randomly, which corresponds to an untrained exploration policy π. A completed pipeline with a joint non-conditional hyper-parameter search space is optimized with Bayesian Optimization for a few steps. The best negative loss is then used as a reward at the end of an episode to guide the reinforcement learning agent towards a better policy. The environment uncertainty only comes with the stochastic reward, while the transition from current state to next state through action is deterministic. We choose to use Q-learning [31] to optimize the policy where we have tried the Tabular Q-learning and Deep Q-learning [22]. We find out that the Tabular Q-learning works better than Deep Q-learning. For space constraint, the latter is not discussed in detail in this work.

We need Bayesian Optimization to optimize the hyper-parameters in a fine grained level with limited budget, but also want to give budget preference to those promising pipelines. To circumvent the complexity of conditional and hiercharical relationship between hyper-parameters and pipeline, we use reinforcement learning to choose a pipeline and let Bayesian Optimization tune the hyper-parameters. We model the variation of the same pipeline with different hyper-parameters as the environment uncertainty. By using separate surrogate model for each pipeline, we circumvent the risk of mistakenly modeling improper dependent structure between different hyper-parameters, at a minor cost of maintaining those searched pipelines surrogate model as dictionary in memory.

3.2 Connections to Hyperband

The idea of only using a few steps Bayesian Optimization is inspired by Hyperband [21], where the trade-off between aggressively exploring more configurations and giving each configuration more resources to be validated is solved by grid searching. Instead, in this paper, we do not need the grid search, promising pipelines will get a higher probability to be selected by our reinforcement learning agent which means these pipelines get more chances to be evaluated by the Bayesian Optimization process. The trade-off between exploitation and exploration is naturally resolved by an ϵ-greedy policy, and by annealing ϵ from a large value to a small value, we encourage more exploration at the beginning. Compared to Hyperband, our method selects the budgets allocated for a particular pipeline automatically, the effectiveness of our strategy could then rely on the recent success of reinforcement learning in different areas.

3.3 Connection and Extension to Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning (hrl) [4] is proposed to tackle the curse of dimensionality in Reinforcement Learning [19]. Although the Option approach [4] is more popular, our method has a close connection to the MAXQ subtask approach [13], which divide a task recursively into subtasks and decompose the value function accordingly. The current version of ReinBo can be treated as a special case of the MAXQ task decomposition, where we have two tasks of pipeline selection and hyper-parameter configuration. However, in the current version, most states are not shared between these two tasks, so there is no need to use MAXQ hrl algorithm to solve the problem. But our method can be naturally extended to a hrl version when our design space of pipeline allow shared state between the two subtasks. We leave it as future work to optimize such complicated pipelines using Hierarchical Reinforcement Learning.

3.4 Procedures of ReinBo

As shown in Algorithm 3.4 , we first initialize a policy π for the agent which can be represented by a randomly initialized neural network or a Q-table initialized with default values, coupled with an exploration mechanism like the ϵ-greedy strategy. During the roll-out, initial populations of pipelines get sampled, with the corresponding hyper-parameter space Λ(iΦai) to be optimized by Bayesian Optimization for several steps. The corresponding surrogate model is stored in the dictionary for future episode if the same pipeline gets rolled out again. The performance of the pipeline on validation data will be used to serve as feedback signal or reward to the reinforcement learning agent to conduct policy iteration. {algorithm} ML Reinbo \[email protected]@algorithmic \REQUIREdataset 𝒟, Operators and Hyper-parameters
Initialize Policy π
Initialize MBO Surrogate Dictionary
\WHILEBudget not reached \STATERoll-out an unconfigured pipeline ai according to policy π
\STATEGet hyper-parameters set for the ground pipeline Λ(ai)=iΦai
\STATEReward RMBO_PROBE(ai,Λ,) \STATEUpdate Policy π with reinforcement learning algorithm \ENDWHILE

Once an unconfigured pipeline is constructed at the end of the episode, running Bayesian Optimization could be beneficial in searching for a more favorable hyper-parameter setting. However, Bayesian Hyperparameter Optimization with large budgets could be rather expensive. Instead, we optimize hyper-parameters for an un-configured pipeline only for several iterations. For example, we take the number of iterations to be 2 or 3 times the dimension of hyper-parameter space, which means that hyper-parameter spaces with higher dimension will get more sampling budgets. After each episode, the current best configuration’s performance for this pipeline in question is used as reward. The next time the same pipeline is sampled, the surrogate model could be retrieved from the dictionary to facilitate further search using MBO. We dub the hyper-parameter search process as MBO_PROBE, with details shown in Algorithm 3.4.22 2 To save budgets, when an un-configured pipeline does not improve after a number of trials of MBO_PROBE, it can also be suspended for future evaluation. If an un-configured pipeline is not sampled yet, an initial design is generated to facilitate an initial surrogate model.


MBO_PROBE(ai,Λ,) \[email protected]@algorithmic \REQUIRE  
\IF{iai}= \STATEgenerate initial design of size ninit for surrogate model;
\FORj in 1:ninit \STATEevaluate each design by initiating the pipeline with corresponding hyper-parameters Λ(ai).
\ENDFOR\STATEinitialize {iai} \ENDIF\FORj in 1:nprobe \STATEpropose new point according to surrogate model {ai} \STATEevaluate new point \ENDFOR


y best accuracy until now

4 Experiments

4.1 Implementation, Comparision Methods and Setups

Our initial implementation for ReinBo is based on R machine learning packages mlr [10], mlrCPO [8] for pipeline construction and mlrMBO [11] for Bayesian Optimization. However, ReinBo could be extended to combine with any machine learning library as backends. The R package parabox33 3 is implemented for this project to specify conditional hierarchical hyper-parameter space and provides the conditional ancestral sampling (random search in conditional hyper-parameter space). The R package rlR44 4 is implemented for reinforcement learning where the user could implement a custom environment as input. All python packages are invoked with the R-Python interface reticulate [1].

To evaluate the performance of our proposed method, we compare the performance of ReinBo with several state of art AutoML systems, as well as several conditional hyper-parameter space tuning methods. We compare against Auto-sklearn [15] and TPOT [24], both based on scikit-learn [25]. At the time of writing, there is no documentation and examples online for ML-Plan [23], thus, ML-Plan is not included in the experiment. Additionally, we compare against hyperparameter optimization techniques which preserve the hierarchical conditional structure, including Tree-structured Parzen Estimator (TPE) [7] used in Hyperopt [6], and Random Search with conditional Ancestral Sampling (self implemented in R package parabox). Random Search remains a very strong baseline in a lot of machine learning hyper-parameter optimization scenarios [5].

Evaluation Criteria  As warned in [23], many state of art AutoML systems seem to have missed to deal with the risk of overfitting. Therefore, in the experiment part, we focus on evaluating the generalization capability of the selected pipeline empirically. To avoid any potential confusion from synonyms, we use Dopt to represent the part of a dataset fed into a given AutoML system and use Dtest to represent the locked out part [28] of the same dataset used to test the generalization capacity. The split of Dopt and Dtest is done by Cross Validation, which means for a dataset D, D=DoptDtest and DoptDtest=. To create the Dopt and Dtest split, we use 5-fold cross-validation (CV5), which corresponds to the outer loop of Nested Cross Validation (NCV). We take the aggregated mmce (mean miss-calssification error) across the 5 iterations of outer loop of NCV as performance measure.

Instead of using running time as budget, we use the number of pipeline configurations as the unit of budget, to account for performances variations due to hardware and network load etc. For Dopt, we assigned a budget of 1000 times of CV5 equivalent to each AutoML algorithm, which corresponds to the inner loop of NCV.

Other Setups  To account for different AutoML systems data input format incompatibility problem, we conduct dummy encoding to categorical features beforehand. Aiming for a light weight implementation, in the experiment, we limit our choice of pipeline components for ReinBo. We compose a pipeline in 3 stages, with potential operations/actions at each stage listed in Table 1. Associated hyper-parameters with an un-configured pipeline are listed in Table 2. We call the components and associated hyper-parameters the pipeline pool. The same pipeline pool is used for ReinBo, TPE and Random Search.

For Auto-sklearn, Meta-learning and ensemble are disabled, the resampling strategy is set to be CV5, stop criteria is changed to budget instead of time and all other configurations are kept default. We have contacted the author of Autosklearn through github for the right use of the API to ensure the above configuration is satisfied. For TPOT (version 0.9), the default configuration space contains a lot of operators while the light version provides only fast models and pre-processors. The light TPOT is therefore less time-consuming but it could probably lead to lower accuracy in consequence. For this reason, we compare ReinBo with both TPOT with the default configuration and TPOT with light configuration, and we call them TPOT and TPOT-light respectively. TPOT is configured to allow equal amount of budgets with all methods being compared and other configurations are left to be default.

Datasets  We experimented on a set of standard benchmarking datasets of high quality collected from OpenML-CC1855 5 [9] and Auto-Weka  [29], which are rather well-curated from many thousands and have diverse numbers of classes, features, observations, as well as various ratios of the minority and majority class size. Summary of these datasets is listed in Table 3.

Table 1: List of pipeline operations. An operation of “NA” here is used to indicate that no operation would be taken in the corresponding stage. Please refer to mlrCPO documentation for the detailed meaning of these operators.
Stage Operation/Action
1 DataPreprocess Scale(default) Scale(center=FALSE) Scale(scale=FALSE) SpatialSign NA
2 Feature Engineering Pca FilterKruskal FilterAnova FilterUnivariate NA
3 Classifier kknn ksvm ranger xgboost naiveBayes
Table 2: List of hyper-parameters to the operations in Table 1. “p” in the column “Range” indicates the number of features of the original dataset. We invite the user to refer to the R packages mlrCPO and mlr documentations for the exact meaning of operation, hyper-parameters, etc.
Operation Parameter Type Range
Anonva,Kruskal,Univariate perc numeric (0.1,1)
Pca rank integer (p/10,p)
kknn k integer (1,20)
ksvm C numeric (2-15,215)
ksvm sigma numeric (2-15,215)
ranger mtry integer (p/10,p/1.5)
ranger sample.fraction numeric (0.1,1)
xgboost eta numeric (0.001,0.3)
xgboost max_depth integer (1,15)
xgboost subsample numeric (0.5,1)
xgboost colsample_bytree numeric (0.5,1)
xgboost min_child_weight numeric (0,50)
naiveBayes laplace numeric (0.01,100)
Table 3: List of OpenML Datasets for experiment. Columns are the OpenML task_id and name, the number of classes (nClass), features (nFeat) and observations (nObs), as well as the ratio of the minority and majority class sizes (rMinMaj).
task_id name nClass nFeat nObs rMinMaj
14 mfeat-fourier 10 77 2000 1.00
23 cmc 3 10 1473 0.53
37 diabetes 2 9 768 0.54
53 vehicle 4 19 846 0.91
3917 kc1 2 22 2109 0.18
9946 wdbc 2 31 569 0.59
9952 phoneme 2 6 5404 0.42
9978 ozone-level-8hr 2 73 2534 0.07
125921 LED-display-domain-7digit 10 8 500 0.65
146817 steel-plates-fault 7 28 1941 0.08
146820 wilt 2 6 4839 0.06

4.2 Experiment results

In Figure 3, we compare the mmce (1-Accuracy) of each method with boxplot over the datasets listed in Table 3 across 10 statistical replications. Additionally, we list numerical results in Table 4 with statistical test, where each numerical value represents the aggregated mean mmce over the statistical replications. Underline in each row indicates the smallest mean value over the corresponding dataset. The bold-faced values indicate that the corresponding algorithm does not perform significantly worse than the underlined algorithm on the corresponding dataset based on Mann-Whitney U test. As shown in Table 4, ML-ReinBo has boldfaces for 8 of 10 datasets followed by much less boldfaces from other methods.

In Table 5, we compare win (significantly better), lose and tie (neither significantly better nor worse) relationships according to the test. As shown in Table 5, ReinBo has won TPOT on 5 datasets and performed worse than TPOT for only one dataset. And not surprisingly, TPOT has performed considerably better than TPOT-light in the empirical experiments since TPOT-light has smaller search space with only fast models and preprocessors. ReinBo also performs much better than Random Search and TPE, where ReinBo has significantly won them on 5 and 6 tasks respectively and lost only on 1 task. Compared to ReinBo, Auto-sklearn has won only once and behaved worse than ReinBo on 3 of 10 datasets.

Table 4: Average performance (mmce) of algorithms across 10 statistical replications with different seeds. In each run the aggregated mmce based over the outer loop of NCV is taken as performance measure for each algorithm. Each value in this table is the mean value of the aggregated mmce values across 10 replications and the best mean value for each dataset is underlined. The bold-faced values indicate that the algorithm does not perform significantly worse than the underlined algorithm on the corresponding dataset based on Mann-Whitney U test.
dataset Auto-sklearn TPE TPOT TPOT-light Random ReinBo Underlined Algorithm
mfeat-fourier 0.1412 0.1542 0.1451 0.1489 0.1580 0.1278 ReinBo
cmc 0.4470 0.4485 0.4457 0.4506 0.4500 0.4485 TPOT
diabetes 0.2483 0.2436 0.2452 0.2413 0.2455 0.2395 ReinBo
vehicle 0.1679 0.2117 0.1784 0.2057 0.2020 0.1621 ReinBo
kc1 0.1421 0.1351 0.1380 0.1438 0.1353 0.1387 TPE
wdbc 0.0299 0.0348 0.0353 0.0264 0.0341 0.0271 TPOT-light
phoneme 0.0902 0.0920 0.0893 0.1016 0.0912 0.0905 TPOT
ozone-level-8hr 0.0588 0.0601 0.0577 0.0603 0.0598 0.0578 TPOT
steel-plates-fault 0.2041 0.2330 0.1985 0.2601 0.2146 0.2141 TPOT
wilt 0.0132 0.0159 0.0141 0.0164 0.0161 0.0123 ReinBo
Figure 3: Boxplots showing the distribution of aggregated mmce achieved by each algorithm within 10 statistical replications.
Table 5: Win-Lose-Tie comparison between ReinBo and other algorithms on benchmarking datasets based on Mann-Whitney U test (significance level α=0.05).
Random_search TPE Auto-sklearn TPOT-light TPOT
ReinBo win 5 6 3 7 5
tie 4 3 6 3 4
lose 1 1 1 0 1

Meanwhile, ReinBo has comparatively short box ranges in most cases as shown in Figure 3. Hence, we would conclude that ReinBo performed better and more stably than other algorithms in our empirical experiments. Besides comparing the final performance, it is also interesting to look into the suggested machine learning pipelines by an AutoML system. The frequencies of the operators in the pipelines suggested by ReinBo are listed in Table 6.

Table 6: Frequency of operators suggested by ReinBo. During empirical experiments there are 500 pipelines in total suggested by ReinBo at the end of optimization process. The frequency (Freq.) and relative frequency (Relative freq.) of each operator selected in best pipelines are shown here.
Preprocess Freq. Relative freq. Feature engineering Freq. Relative freq. Classifier Freq. Relative freq.
Scale(default) 259 51.8% FilterAnova 210 42.0% ksvm 276 55.2%
Scale(scale=FALSE) 106 21.2% FilterKruskal 139 27.8% ranger 201 40.2%
Scale(center=FALSE) 67 13.4% PCA 63 12.6% kknn 12 2.4%
NA 36 7.2% Univariate 46 9.2% xgboost 10 2.0%
SpatialSign 32 6.4% NA 42 8.4% naiveBayes 1 0.2%

Running time  Figure 4 shows the average running time each algorithm takes to complete one experiment, which corresponds to a complete Nested Cross Validation (NCV) process.

Figure 4: Barplot of the running time for each algorithm in average. The value reported here corresponds to the average running time of each algorithm per data set based on the NCV strategy.

From the figure, it can be seen that Auto-sklearn is the most time-consuming algorithm in our empirical experiments. Although TPOT-light is the fastest algorithm, it resulted in worse performance because it contains only fast operators. Our proposed ReinBo algorithm spent less time than Random Search and state of art AutoML systems TPOT and Auto-sklearn in average.

5 Summary and Future Work

We present a new AutoML algorithm ReinBo by embedding Bayesian Optimization into Reinforcement Learning. The Reinforcement Learning takes care of pipeline composition, and Bayesian Optimization takes care of configuring the hyper-parameters associated with the composed pipeline. ReinBo is inspired by Hyperband and previous efforts in AutoML by considering the trade-off of assigning resources to a particular configuration and exploring more configurations as a reinforcement learning problem, where the learned policy solves the trade-off automatically. Experiments show our method has a considerable improvement compared to other state of art systems and methods. For future work, it would be interesting to include meta learning into our system, which does not only learn how to construct a pipeline and configure it for a dataset in question, but also how to generalize the learned policy to a wide range of dataset by learning jointly on the meta features. Additionally, it would be nice to see how ReinBo performs on jointly optimizing neural architecture and continuous hyper-parameters like learning rate and momentum, as well as applications like Computer Vision [18] and semantic web based Recommendation Systems [20] where pipeline might play a role. Multi-Objective Bayesian Optimization [16] for the hyper-parameter tuning would be another future direction.

6 Acknowledgement

Janek Thomas has given us a lot of helpful suggestions, Martin Binder has helped us a lot with the mlrCPO setting, Florian Pfisterer has helped us with the Auto-sklearn setup. We thank all the above and Tieu-binh Ly for giving us a lot of suggestions for the language.


  • [1] Allaire, J., Ushey, K., Tang, Y.: reticulate: Interface to ’Python’ (2019),, r package version 1.11.1
  • [2] Baker, B., Gupta, O., Naik, N., Raskar, R.: Designing neural network architectures using reinforcement learning. International Conference on Learning Representations (2017)
  • [3] Banzhaf, W., Nordin, P., Keller, R.E., Francone, F.D.: Genetic programming: an introduction, vol. 1. Morgan Kaufmann San Francisco (1998)
  • [4] Barto, A.G., Mahadevan, S.: Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems 13(1-2), 41–77 (2003)
  • [5] Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. Journal of Machine Learning Research 13(Feb), 281–305 (2012)
  • [6] Bergstra, J., Yamins, D., Cox, D.D.: Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In: Proceedings of the 12th Python in science conference. pp. 13–20. Citeseer (2013)
  • [7] Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: Advances in neural information processing systems. pp. 2546–2554 (2011)
  • [8] Binder, M.: mlrCPO: Composable Preprocessing Operators and Pipelines for Machine Learning (2019),, r package version 0.3.4-2
  • [9] Bischl, B., Casalicchio, G., Feurer, M., Hutter, F., Lang, M., Mantovani, R.G., van Rijn, J.N., Vanschoren, J.: Openml benchmarking suites and the openml100. arXiv preprint arXiv:1708.03731 (2017)
  • [10] Bischl, B., Lang, M., Kotthoff, L., Schiffner, J., Richter, J., Studerus, E., Casalicchio, G., Jones, Z.M.: mlr: Machine learning in r. Journal of Machine Learning Research 17(170),  1–5 (2016),
  • [11] Bischl, B., Richter, J., Bossek, J., Horn, D., Thomas, J., Lang, M.: mlrmbo: A modular framework for model-based optimization of expensive black-box functions. arXiv preprint arXiv:1703.03373 (2017)
  • [12] Brochu, E., Cora, V.M., De Freitas, N.: A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 (2010)
  • [13] Dietterich, T.G.: Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research 13, 227–303 (2000)
  • [14] Drori, I., Krishnamurthy, Y., Rampin, R., de Paula Lourenco, R., Ono, J.P., Cho, K., Silva, C., Freire, J.: Alphad3m: Machine learning pipeline synthesis. In: AutoML Workshop at ICML (2018)
  • [15] Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Advances in Neural Information Processing Systems. pp. 2962–2970 (2015)
  • [16] Horn, D., Dagge, M., Sun, X., Bischl, B.: First investigations on noisy model-based multi-objective optimization. In: International Conference on Evolutionary Multi-Criterion Optimization. pp. 298–313. Springer (2017)
  • [17] Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization. pp. 507–523. Springer (2011)
  • [18] Jiulong, Z., Luming, G., Su, Y., Xudong, S., Xiaoshan, L.: Detecting chinese calligraphy style consistency by deep learning and one-class svm. In: 2017 2nd International Conference on Image, Vision and Computing (ICIVC). pp. 83–86. IEEE (2017)
  • [19] Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J.: Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In: Advances in neural information processing systems. pp. 3675–3683 (2016)
  • [20] Kushwaha, N., Sun, X., Singh, B., Vyas, O.: A lesson learned from pmf based approach for semantic recommender system. Journal of Intelligent Information Systems 50(3), 441–453 (2018)
  • [21] Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research 18(1), 6765–6816 (2017)
  • [22] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. Nature 518(7540),  529 (2015)
  • [23] Mohr, F., Wever, M., Hüllermeier, E.: Ml-plan: Automated machine learning via hierarchical planning. Machine Learning 107(8-10), 1495–1515 (2018)
  • [24] Olson, R.S., Moore, J.H.: Tpot: A tree-based pipeline optimization tool for automating machine learning. In: Workshop on Automatic Machine Learning. pp. 66–74 (2016)
  • [25] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
  • [26] de Sá, A.G., Pinto, W.J.G., Oliveira, L.O.V., Pappa, G.L.: Recipe: a grammar-based framework for automatically evolving classification pipelines. In: European Conference on Genetic Programming. pp. 246–261. Springer (2017)
  • [27] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of go without human knowledge. Nature 550(7676),  354 (2017)
  • [28] Sun, X., Bommert, A., Pfisterer, F., Rahnenführer, J., Lang, M., Bischl, B.: High dimensional restrictive federated model selection with multi-objective bayesian optimization over shifted distributions (2019)
  • [29] Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-weka: Automated selection and hyper-parameter optimization of classification algorithms. CoRR abs/1208.3719 (2012),
  • [30] Thornton, C., Leyton-Brown, K.: Auto-weka: Automated selection and hyper-parameter optimization of classification algorithms (2012)
  • [31] Watkins, C.J., Dayan, P.: Q-learning. Machine learning 8(3-4), 279–292 (1992)
  • [32] Yang, F., Gustafson, S., Elkholy, A., Lyu, D., Liu, B.: Program search for machine learning pipelines leveraging symbolic planning and reinforcement learning. In: Genetic Programming Theory and Practice XVI, pp. 209–231. Springer (2019)
  • [33] Yang, F., Lyu, D., Liu, B., Gustafson, S.: Peorl: Integrating symbolic planning and hierarchical reinforcement learning for robust decision-making. arXiv preprint arXiv:1804.07779 (2018)
  • [34] Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016)