Abstract
Machine learning pipeline potentially consists of several stages ofoperations like data preprocessing, feature engineering and machine learningmodel training. Each operation has a set of hyperparameters, which can becomeirrelevant for the pipeline when the operation is not selected. This gives riseto a hierarchical conditional hyperparameter space. To optimize this mixedcontinuous and discrete conditional hierarchical hyperparameter space, wepropose an efficient pipeline search and configuration algorithm which combinesthe power of Reinforcement Learning and Bayesian Optimization. Empiricalresults show that our method performs favorably compared to state of the artmethods like Autosklearn , TPOT, Tree Parzen Window, and Random Search.
Quick Read (beta)
ReinBo: Machine Learning pipeline search and configuration with Bayesian Optimization embedded Reinforcement Learning
Abstract
Machine learning pipeline potentially consists of several stages of operations like data preprocessing, feature engineering and machine learning model training. Each operation has a set of hyperparameters, which can become irrelevant for the pipeline when the operation is not selected. This gives rise to a hierarchical conditional hyperparameter space. To optimize this mixed continuous and discrete conditional hierarchical hyperparameter space, we propose an efficient pipeline search and configuration algorithm which combines the power of Reinforcement Learning and Bayesian Optimization. Empirical results show that our method performs favorably compared to state of the art methods like Autosklearn , TPOT, Tree Parzen Window, and Random Search.
Keywords:
Bayesian Optimization Reinforcement Learning AutoMLshapes,backgrounds \usetikzlibraryarrows,positioning \usetikzlibrarytrees \usetikzlibrarychains \usetikzlibraryarrows,petri,topaths\usetikzlibraryshadows
1 Introduction
Over the past years, Machine Learning (ML) has achieved remarkable success in a wide range of application areas, which has greatly increased the demand for machine learning systems. However, an efficient machine learning algorithm crucially depends on a human expert, who has to carefully design the pipeline of the machine learning system, potentially consisting of data preprocess, feature filtering, machine learning algorithm selection, as well as identifying a good set of hyperparameters. As there are a large number of possible alternatives of models as well as hyperparameters, the need for automated machine learning (AutoML) has been growing, which is supposed to automatically generate a data analysis pipeline with machine learning methods and parameter settings that are optimized for a given data set, in order to make machine learning methods available for nonexpert users.
Since hyperparameters of a machine learning model have a large influence on the performance of the model, hyperparameter optimization becomes a critical part of an AutoML system. Popular hyperparameter optimization methods include Sequential Bayesian Optimization, which iterates between fitting surrogate models that predict model performance, and using them to make choices about which configurations to investigate.
However, the composition of the machine learning pipelines also plays a vital role in the performance of AutoML systems. Choosing different data preprocessing or feature engineering techniques as well as choosing different machine learning models for a specific dataset could potentially result in considerable performance differences. The joint optimization of the pipeline search and its associated hyperparameters configuration could essentially reside under the umbrella of Combined Algorithm Selection and Hyperparameter optimization (CASH) problem [30], where Algorithm corresponds to the pipeline and Configuration corresponds to the hyperparameters associated with the pipeline. The pipelines and hyperparameters reside in a conditional hierarchical space, where some hyperparameters only become valid when the corresponding pipeline is present. For example, Figure 1 illustrates such a situation when the data preprocessing and feature engineering operations are selected, which correspond to an incomplete pipeline, one out of three machine learning algorithms need to be chosen (indicated by dashed edges) to complete the pipeline, the corresponding hyperparameters (indicated by solid edges) of an algorithm only become valid when the algorithm is selected.
To optimize the conditional hyperparameters space jointly with the pipeline it is attached to, we embed Bayesian Optimization in the Reinforcement Learning process, and dub the method ReinBo, which means Machine Learning Pipeline search and configuration with Reinforcement Learning and Bayesian Optimization. Note that ReinBo can solve not only CASH problems, but also any mixed discrete and continuous conditional hierarchical space optimization, which is left for future work.
Our major contributions are:

•
Inspired by Hierarchical Reinforcement Learning [13], we transform the conditional hierarchical hyperparameter optimization problem into subtasks of pipeline selection and hyperparameter optimization, which circumvents the conditional constraint and reduces the search dimension.

•
To our best knowledge, we are the first to embed Bayesian Optimization (BO) into Reinforcement learning, specifically Q Learning [31] for collaborative joint search of pipelines and hyperparameters, which is different from using BO for policy optimization [12], and also different from using BO for hyperparameter fine tuning after an optimal pipeline is selected by a reinforcement learning based AutoML framework [32].

•
We provide an open source light weight R language implementation reinbo^{1}^{1} 1 https://github.com/smilesun/reinbo for the R Machine Learning community which could run efficiently on a personal computer, and takes much less resources compared to other AutoML softwares.
2 Related Work
In this section, we try to classify the current popular AutoML solutions into a taxonomy and discuss the differences of each individual work with ours.
Sequential Model Based Optimization family Autosklearn [15] and AutoWeka [30] both use Sequential Modelbased Algorithm Configuration (SMAC) [17] to solve the Combined Algorithm Selection and Hyperparameter optimization (CASH) problem. SMAC[17] transforms the conditional hierarchical hyperparameter space into a flat structure by instantiating inactive conditional parameters to default values, which allows the random forest to focus on active hyperparameters [17]. A potential drawback for this method is that the surrogate model needs to learn in a high dimensional hyperparameter space, which might need a large sample of observations to be sufficiently trained, while in practice, running machine learning algorithm is usually very expensive. Tree Parzen Window (TPE) [7], however, tackles the conditional hierarchical hyperparameter space using a tree like Parzen Window to construct two density estimators on top of a tree like hyperparameter set. Expected improvement induced from lower and upper quantile density estimators is used to select new candidate proposals from points generated by Ancestral Sampling.
Treebased Genetic Programming TPOT [24] automatically designs and optimizes machine learning pipelines with a genetic programming [3] algorithm. The machine learning operators are used as genetic programming primitives, which will be combined by treebased pipelines and the Genetic Programming algorithm is used to evolve treebased pipelines until the “best” pipeline is found. Similar methods also include Recipe [26]. However, this family of methods does not scale well [23]. In this paper, we aim for an AutoML system that could give a valuable configured pipeline within limited time.
Monte Carlo Tree Search Alike MLPlan [23] is an AutoML system, built upon a Hierarchical Task Network, which uses a Monte Carlo Tree Search like algorithm to search for pipelines and also configure the pipeline with hyperparameters. Task is expanded based on bestfirst search, where the score is estimated by a randomized depth first search by randomly trying different subtree possibilities on a Hierarchical Task Network. To ensure exploration, MLPlan gives equal possibility to the starting node in a Hierarchical Task Network and then uses a bestfirst strategy for searching at the lower part of the network. Potential drawback for this method is that the hyperparameter space is discretized, which might essentially lose good candidates in continuous spaces since large continuous hyperparameter spaces would be essentially hard to discretize.
Reinforcement Learning based Neural Network Architecture Search This family of methods are usually not termed as AutoML systems but rather Neural Architecture Search. For instance, MetaQNN [2] uses Qlearning to search convolutional neural network architectures. The learning agent is trained to sequentially choose CNN layers using Qlearning with an $\u03f5$greedy exploration strategy and the goal is to maximize the crossvalidation accuracy. In [34], instead of using Qlearning, the authors use Recurrent Neural Networks as the reinforcement learning policy approximator to generate variable strings to represent various neural architecture forms. The rewardfunction is designed to be the validation performance of the constructed network. The reinforcement learning policy is trained with gradient descent algorithm, specifically REINFORCE. The architecture elements being searched are very similar to MetaQNN. Inspired from [34], we also assume the machine learning pipeline to be optimized could be represented by a variable length string, but our work is different from [34] in that we use both Deep Qlearning and Tabular Qlearning. More importantly, compared with both [2] and [34], which optimize the neural architecture, the elements of the architecture are mostly factor variables like layer type or discretized elements like filter size, while in this paper, we deal heavily with continuous hyperparameters (The $C$ and $\sigma $ for a rbf kernel Support Vector Machine). To jointly optimize the discrete pipeline choice and associated continuous hyperparameters, we embed Bayesian Optimization inside our reinforcement learning agent.
Other Reinforcement Learning based methods In [32], the authors also combine pipeline search and hyperparameter optimization in a reinforcement learning process based on the PEORL [33] framework, however, the hyperparameter is randomly sampled during the reinforcement learning process, an extra stage is needed to sweep the hyperparameters using hyperparameter optimization techniques, while in our work, hyperparameter optimization is embedded in the reinforcement learning process. Alpha3M [14] combined MCTS and recurrent neural network in a self play [27] fashion, however, it seems that Alpha3M does not perform better than the state of art AutoML systems. For example, out of all the 6 OpenML datasets they have used to compare with state of art AutoML systems, Alpha3M only shows a clear improvement on 1 dataset (spectf) against Autosklearn [15] and TPOT [24], according to Figure 4 in [14]. Furthermore, it is not clear to us how the hyperparameters are set and if Bayesian Optimization is used. The implementation of Alpha3M takes advantage of the GPUs [14] for the fast performance while our method has a light weight implementation which efficiently runs with CPU and does not necessarily need Neural Networks.
3 Method
3.1 Towards Reinbo
As shown in Figure 2, we assume that a machine learning pipeline potentially consists of 3 stages (s1 through s3 in the figure), which include data preprocessing (imputations, NA and more), feature engineering (Principal Component Analysis for feature transform, Anova for feature filtering and more), and machine learning model (learner like SVM, Random Forest). Specifically, we use operation “NA” to indicate that no operation would be done in the stage in question. Figure 2 just serves as a toy but working example for ReinBo, in practice, there are a lot more operations available. A particular operation has associated hyperparameters (for instance the percentage of selected features in Anova feature filtering). In Figure 2, dark color filled cells (NA, Anova, SVM) represent selected operations and their associated active hyperparameters (percentage, sigma, C), while hyperparameters for inactive operations are not drawn in the figure.
Observing from Figure 2, along with Figure 1, we could think of the pipeline selection and configuration problem as a twophase process. During the first phase, a planning algorithm guides the agent to choose a path which corresponds to an unconfigured pipeline. This is similar to a multiarmed bandit problem, where each path corresponds to one arm, while difference lies in that the agent can not directly pull a discrete arm but have to pull across several consecutive discrete arm groups (each arm group corresponds to a stage in Figure 2) and the agent only gets reward after choosing one of arms from the last group. The second phase is similar to contextual bandit with continuous action space (corresponding to hyperparameters), where the context is which path from the first phase has been selected.
We model the first phase as a reinforcement learning episode, where a particular operation in stage $i$ is treated as action ${a}_{i}$, taken upon corresponding state ${s}_{i}$. State ${s}_{i}$ could be represented by actions taken up to the current stage for example. The pipeline search problem is then to find an optimal policy $\pi $ to decide which operation (action) to take at a particular state. The action value function $Q(s,a)$ at each state tells us how favorable a particular operation is. We use ${\mathcal{A}}_{{s}_{i}}$ to denote the space of legal actions at state ${s}_{i}$. Suppose a rollout of states trajectory for one composition (episode) is ${s}_{1},\mathrm{\dots},{s}_{K}$, the corresponding space of pipeline could be denoted by ${\prod}_{i=1}^{K}{\mathcal{A}}_{{s}_{i}}$, where $K$ is the total number of stages and we use $\prod $ to denote the Cartesian Product. For a more general notation, we use $\mathcal{A}({S}_{i})$ to denote the space of actions when the state is ${S}_{i}$ at stage $i$.
We search for potentially better hyperparameters in the second phase with Bayesian Optimization. Aside from the pipeline itself, each concrete operation (action ${a}_{i}$) at stage $i$ is configurable by a set of hyperparameters ${\mathrm{\Phi}}_{{a}_{i}}$. ${\mathrm{\Phi}}_{{a}_{i}}$ can be hyperparameters set for a preprocessor like the ratio of variance to keep in PCA or hyperparameters set for a machine learning model like the $C$ and $\sigma $ hyperparameter for SVM. Thus a configured pipeline search space would be ${\prod}_{i=1}^{K}\mathcal{A}({S}_{i};{\mathrm{\Phi}}_{{a}_{i}})$ where we use ${\mathrm{\Phi}}_{{a}_{i}}$ to denote the conditional hyperparameter space at stage $i$.
The connection point between reinforcement learning and Bayesian Optimization lies in the reward function design in the reinforcement learning part. During the composition process, there is no signal available to judge how good a current uncompleted pipeline is until the final learner (classifier) is configured with hyperparameters and trained on the data. At the starting point, different pipelines are tried out randomly, which corresponds to an untrained exploration policy $\pi $. A completed pipeline with a joint nonconditional hyperparameter search space is optimized with Bayesian Optimization for a few steps. The best negative loss is then used as a reward at the end of an episode to guide the reinforcement learning agent towards a better policy. The environment uncertainty only comes with the stochastic reward, while the transition from current state to next state through action is deterministic. We choose to use Qlearning [31] to optimize the policy where we have tried the Tabular Qlearning and Deep Qlearning [22]. We find out that the Tabular Qlearning works better than Deep Qlearning. For space constraint, the latter is not discussed in detail in this work.
We need Bayesian Optimization to optimize the hyperparameters in a fine grained level with limited budget, but also want to give budget preference to those promising pipelines. To circumvent the complexity of conditional and hiercharical relationship between hyperparameters and pipeline, we use reinforcement learning to choose a pipeline and let Bayesian Optimization tune the hyperparameters. We model the variation of the same pipeline with different hyperparameters as the environment uncertainty. By using separate surrogate model for each pipeline, we circumvent the risk of mistakenly modeling improper dependent structure between different hyperparameters, at a minor cost of maintaining those searched pipelines surrogate model as dictionary in memory.
3.2 Connections to Hyperband
The idea of only using a few steps Bayesian Optimization is inspired by Hyperband [21], where the tradeoff between aggressively exploring more configurations and giving each configuration more resources to be validated is solved by grid searching. Instead, in this paper, we do not need the grid search, promising pipelines will get a higher probability to be selected by our reinforcement learning agent which means these pipelines get more chances to be evaluated by the Bayesian Optimization process. The tradeoff between exploitation and exploration is naturally resolved by an $\u03f5$greedy policy, and by annealing $\u03f5$ from a large value to a small value, we encourage more exploration at the beginning. Compared to Hyperband, our method selects the budgets allocated for a particular pipeline automatically, the effectiveness of our strategy could then rely on the recent success of reinforcement learning in different areas.
3.3 Connection and Extension to Hierarchical Reinforcement Learning
Hierarchical Reinforcement Learning (hrl) [4] is proposed to tackle the curse of dimensionality in Reinforcement Learning [19]. Although the Option approach [4] is more popular, our method has a close connection to the MAXQ subtask approach [13], which divide a task recursively into subtasks and decompose the value function accordingly. The current version of ReinBo can be treated as a special case of the MAXQ task decomposition, where we have two tasks of pipeline selection and hyperparameter configuration. However, in the current version, most states are not shared between these two tasks, so there is no need to use MAXQ hrl algorithm to solve the problem. But our method can be naturally extended to a hrl version when our design space of pipeline allow shared state between the two subtasks. We leave it as future work to optimize such complicated pipelines using Hierarchical Reinforcement Learning.
3.4 Procedures of ReinBo
As shown in Algorithm 3.4 , we first initialize a policy $\pi $ for the agent which can be represented by a randomly initialized neural network or a Qtable initialized with default values, coupled with an exploration mechanism like the $\u03f5$greedy strategy. During the rollout, initial populations of pipelines get sampled, with the corresponding hyperparameter space $\mathrm{\Lambda}({\prod}_{i}{\mathrm{\Phi}}_{{a}_{i}})$ to be optimized by Bayesian Optimization for several steps. The corresponding surrogate model is stored in the dictionary $\mathcal{R}$ for future episode if the same pipeline gets rolled out again. The performance of the pipeline on validation data will be used to serve as feedback signal or reward to the reinforcement learning agent to conduct policy iteration.
{algorithm}
\[email protected]@algorithmic
\REQUIREdataset $\mathcal{D}$, Operators and Hyperparameters
Initialize Policy $\pi $
Initialize MBO Surrogate Dictionary $\mathcal{R}\leftarrow \mathrm{\varnothing}$
\WHILEBudget not reached
\STATERollout an unconfigured pipeline $\prod {a}_{i}$ according to policy $\pi $
\STATEGet hyperparameters set for the ground pipeline $\mathrm{\Lambda}(\prod {a}_{i})={\prod}_{i}{\mathrm{\Phi}}_{{a}_{i}}$
\STATEReward $R\leftarrow \text{MBO\_PROBE}(\prod {a}_{i},\mathrm{\Lambda},\mathcal{R})$
\STATEUpdate Policy $\pi $ with reinforcement learning algorithm
\ENDWHILE
Once an unconfigured pipeline is constructed at the end of the episode, running Bayesian Optimization could be beneficial in searching for a more favorable hyperparameter setting. However, Bayesian Hyperparameter Optimization with large budgets could be rather expensive. Instead, we optimize hyperparameters for an unconfigured pipeline only for several iterations. For example, we take the number of iterations to be 2 or 3 times the dimension of hyperparameter space, which means that hyperparameter spaces with higher dimension will get more sampling budgets. After each episode, the current best configuration’s performance for this pipeline in question is used as reward. The next time the same pipeline is sampled, the surrogate model could be retrieved from the dictionary $\mathcal{R}$ to facilitate further search using MBO. We dub the hyperparameter search process as MBO_PROBE, with details shown in Algorithm 3.4.^{2}^{2} 2 To save budgets, when an unconfigured pipeline does not improve after a number of trials of MBO_PROBE, it can also be suspended for future evaluation. If an unconfigured pipeline is not sampled yet, an initial design is generated to facilitate an initial surrogate model.
\[email protected]@algorithmic
\REQUIRE
\IF$\mathcal{R}\{{\prod}_{i}{a}_{i}\}=\mathrm{\varnothing}$
\STATEgenerate initial design of size ${n}^{init}$ for surrogate model;
\FOR$j$ in $1:{n}^{init}$
\STATEevaluate each design by initiating the pipeline with corresponding hyperparameters $\mathrm{\Lambda}(\prod {a}_{i})$.
\ENDFOR\STATEinitialize $\mathcal{R}\{{\prod}_{i}{a}_{i}\}$
\ENDIF\FOR$j$ in $1:{n}^{probe}$
\STATEpropose new point according to surrogate model $\mathcal{R}\{\prod {a}_{i}\}$
\STATEevaluate new point
\ENDFOR
$y$ $\leftarrow $ best accuracy until now
4 Experiments
4.1 Implementation, Comparision Methods and Setups
Our initial implementation for ReinBo is based on R machine learning packages mlr [10], mlrCPO [8] for pipeline construction and mlrMBO [11] for Bayesian Optimization. However, ReinBo could be extended to combine with any machine learning library as backends. The R package parabox^{3}^{3} 3 https://github.com/smilesun/parabox is implemented for this project to specify conditional hierarchical hyperparameter space and provides the conditional ancestral sampling (random search in conditional hyperparameter space). The R package rlR^{4}^{4} 4 https://github.com/smilesun/rlR is implemented for reinforcement learning where the user could implement a custom environment as input. All python packages are invoked with the RPython interface reticulate [1].
To evaluate the performance of our proposed method, we compare the performance of ReinBo with several state of art AutoML systems, as well as several conditional hyperparameter space tuning methods. We compare against Autosklearn [15] and TPOT [24], both based on scikitlearn [25]. At the time of writing, there is no documentation and examples online for MLPlan [23], thus, MLPlan is not included in the experiment. Additionally, we compare against hyperparameter optimization techniques which preserve the hierarchical conditional structure, including Treestructured Parzen Estimator (TPE) [7] used in Hyperopt [6], and Random Search with conditional Ancestral Sampling (self implemented in R package parabox). Random Search remains a very strong baseline in a lot of machine learning hyperparameter optimization scenarios [5].
Evaluation Criteria As warned in [23], many state of art AutoML systems seem to have missed to deal with the risk of overfitting. Therefore, in the experiment part, we focus on evaluating the generalization capability of the selected pipeline empirically. To avoid any potential confusion from synonyms, we use ${D}^{opt}$ to represent the part of a dataset fed into a given AutoML system and use ${D}^{test}$ to represent the locked out part [28] of the same dataset used to test the generalization capacity. The split of ${D}^{opt}$ and ${D}^{test}$ is done by Cross Validation, which means for a dataset $D$, $D={D}^{opt}\bigcup {D}^{test}$ and ${D}^{opt}\bigcap {D}^{test}=\mathrm{\varnothing}$. To create the ${D}^{opt}$ and ${D}^{test}$ split, we use 5fold crossvalidation (CV5), which corresponds to the outer loop of Nested Cross Validation (NCV). We take the aggregated mmce (mean misscalssification error) across the 5 iterations of outer loop of NCV as performance measure.
Instead of using running time as budget, we use the number of pipeline configurations as the unit of budget, to account for performances variations due to hardware and network load etc. For ${D}^{opt}$, we assigned a budget of 1000 times of CV5 equivalent to each AutoML algorithm, which corresponds to the inner loop of NCV.
Other Setups To account for different AutoML systems data input format incompatibility problem, we conduct dummy encoding to categorical features beforehand. Aiming for a light weight implementation, in the experiment, we limit our choice of pipeline components for ReinBo. We compose a pipeline in 3 stages, with potential operations/actions at each stage listed in Table 1. Associated hyperparameters with an unconfigured pipeline are listed in Table 2. We call the components and associated hyperparameters the pipeline pool. The same pipeline pool is used for ReinBo, TPE and Random Search.
For Autosklearn, Metalearning and ensemble are disabled, the resampling strategy is set to be CV5, stop criteria is changed to budget instead of time and all other configurations are kept default. We have contacted the author of Autosklearn through github for the right use of the API to ensure the above configuration is satisfied. For TPOT (version 0.9), the default configuration space contains a lot of operators while the light version provides only fast models and preprocessors. The light TPOT is therefore less timeconsuming but it could probably lead to lower accuracy in consequence. For this reason, we compare ReinBo with both TPOT with the default configuration and TPOT with light configuration, and we call them TPOT and TPOTlight respectively. TPOT is configured to allow equal amount of budgets with all methods being compared and other configurations are left to be default.
Datasets We experimented on a set of standard benchmarking datasets of high quality collected from OpenMLCC18^{5}^{5} 5 https://www.openml.org/s/99 [9] and AutoWeka [29], which are rather wellcurated from many thousands and have diverse numbers of classes, features, observations, as well as various ratios of the minority and majority class size. Summary of these datasets is listed in Table 3.
Stage  Operation/Action  

1  DataPreprocess  Scale(default)  Scale(center=FALSE)  Scale(scale=FALSE)  SpatialSign  NA 
2  Feature Engineering  Pca  FilterKruskal  FilterAnova  FilterUnivariate  NA 
3  Classifier  kknn  ksvm  ranger  xgboost  naiveBayes 
Operation  Parameter  Type  Range 

Anonva,Kruskal,Univariate  perc  numeric  (0.1,1) 
Pca  rank  integer  (p/10,p) 
kknn  k  integer  (1,20) 
ksvm  C  numeric  (${2}^{15}$,${2}^{15}$) 
ksvm  sigma  numeric  (${2}^{15}$,${2}^{15}$) 
ranger  mtry  integer  (p/10,p/1.5) 
ranger  sample.fraction  numeric  (0.1,1) 
xgboost  eta  numeric  (0.001,0.3) 
xgboost  max_depth  integer  (1,15) 
xgboost  subsample  numeric  (0.5,1) 
xgboost  colsample_bytree  numeric  (0.5,1) 
xgboost  min_child_weight  numeric  (0,50) 
naiveBayes  laplace  numeric  (0.01,100) 
task_id  name  nClass  nFeat  nObs  rMinMaj 

14  mfeatfourier  10  77  2000  1.00 
23  cmc  3  10  1473  0.53 
37  diabetes  2  9  768  0.54 
53  vehicle  4  19  846  0.91 
3917  kc1  2  22  2109  0.18 
9946  wdbc  2  31  569  0.59 
9952  phoneme  2  6  5404  0.42 
9978  ozonelevel8hr  2  73  2534  0.07 
125921  LEDdisplaydomain7digit  10  8  500  0.65 
146817  steelplatesfault  7  28  1941  0.08 
146820  wilt  2  6  4839  0.06 
4.2 Experiment results
In Figure 3, we compare the mmce (1Accuracy) of each method with boxplot over the datasets listed in Table 3 across 10 statistical replications. Additionally, we list numerical results in Table 4 with statistical test, where each numerical value represents the aggregated mean mmce over the statistical replications. Underline in each row indicates the smallest mean value over the corresponding dataset. The boldfaced values indicate that the corresponding algorithm does not perform significantly worse than the underlined algorithm on the corresponding dataset based on MannWhitney U test. As shown in Table 4, MLReinBo has boldfaces for 8 of 10 datasets followed by much less boldfaces from other methods.
In Table 5, we compare win (significantly better), lose and tie (neither significantly better nor worse) relationships according to the test. As shown in Table 5, ReinBo has won TPOT on 5 datasets and performed worse than TPOT for only one dataset. And not surprisingly, TPOT has performed considerably better than TPOTlight in the empirical experiments since TPOTlight has smaller search space with only fast models and preprocessors. ReinBo also performs much better than Random Search and TPE, where ReinBo has significantly won them on 5 and 6 tasks respectively and lost only on 1 task. Compared to ReinBo, Autosklearn has won only once and behaved worse than ReinBo on 3 of 10 datasets.
dataset  Autosklearn  TPE  TPOT  TPOTlight  Random  ReinBo  Underlined Algorithm 

mfeatfourier  0.1412  0.1542  0.1451  0.1489  0.1580  0.1278  ReinBo 
cmc  0.4470  0.4485  0.4457  0.4506  0.4500  0.4485  TPOT 
diabetes  0.2483  0.2436  0.2452  0.2413  0.2455  0.2395  ReinBo 
vehicle  0.1679  0.2117  0.1784  0.2057  0.2020  0.1621  ReinBo 
kc1  0.1421  0.1351  0.1380  0.1438  0.1353  0.1387  TPE 
wdbc  0.0299  0.0348  0.0353  0.0264  0.0341  0.0271  TPOTlight 
phoneme  0.0902  0.0920  0.0893  0.1016  0.0912  0.0905  TPOT 
ozonelevel8hr  0.0588  0.0601  0.0577  0.0603  0.0598  0.0578  TPOT 
steelplatesfault  0.2041  0.2330  0.1985  0.2601  0.2146  0.2141  TPOT 
wilt  0.0132  0.0159  0.0141  0.0164  0.0161  0.0123  ReinBo 
Random_search  TPE  Autosklearn  TPOTlight  TPOT  

ReinBo  win  5  6  3  7  5 
tie  4  3  6  3  4  
lose  1  1  1  0  1 
Meanwhile, ReinBo has comparatively short box ranges in most cases as shown in Figure 3. Hence, we would conclude that ReinBo performed better and more stably than other algorithms in our empirical experiments. Besides comparing the final performance, it is also interesting to look into the suggested machine learning pipelines by an AutoML system. The frequencies of the operators in the pipelines suggested by ReinBo are listed in Table 6.
Preprocess  Freq.  Relative freq.  Feature engineering  Freq.  Relative freq.  Classifier  Freq.  Relative freq. 

Scale(default)  259  51.8%  FilterAnova  210  42.0%  ksvm  276  55.2% 
Scale(scale=FALSE)  106  21.2%  FilterKruskal  139  27.8%  ranger  201  40.2% 
Scale(center=FALSE)  67  13.4%  PCA  63  12.6%  kknn  12  2.4% 
NA  36  7.2%  Univariate  46  9.2%  xgboost  10  2.0% 
SpatialSign  32  6.4%  NA  42  8.4%  naiveBayes  1  0.2% 
Running time Figure 4 shows the average running time each algorithm takes to complete one experiment, which corresponds to a complete Nested Cross Validation (NCV) process.
From the figure, it can be seen that Autosklearn is the most timeconsuming algorithm in our empirical experiments. Although TPOTlight is the fastest algorithm, it resulted in worse performance because it contains only fast operators. Our proposed ReinBo algorithm spent less time than Random Search and state of art AutoML systems TPOT and Autosklearn in average.
5 Summary and Future Work
We present a new AutoML algorithm ReinBo by embedding Bayesian Optimization into Reinforcement Learning. The Reinforcement Learning takes care of pipeline composition, and Bayesian Optimization takes care of configuring the hyperparameters associated with the composed pipeline. ReinBo is inspired by Hyperband and previous efforts in AutoML by considering the tradeoff of assigning resources to a particular configuration and exploring more configurations as a reinforcement learning problem, where the learned policy solves the tradeoff automatically. Experiments show our method has a considerable improvement compared to other state of art systems and methods. For future work, it would be interesting to include meta learning into our system, which does not only learn how to construct a pipeline and configure it for a dataset in question, but also how to generalize the learned policy to a wide range of dataset by learning jointly on the meta features. Additionally, it would be nice to see how ReinBo performs on jointly optimizing neural architecture and continuous hyperparameters like learning rate and momentum, as well as applications like Computer Vision [18] and semantic web based Recommendation Systems [20] where pipeline might play a role. MultiObjective Bayesian Optimization [16] for the hyperparameter tuning would be another future direction.
6 Acknowledgement
Janek Thomas has given us a lot of helpful suggestions, Martin Binder has helped us a lot with the mlrCPO setting, Florian Pfisterer has helped us with the Autosklearn setup. We thank all the above and Tieubinh Ly for giving us a lot of suggestions for the language.
References
 [1] Allaire, J., Ushey, K., Tang, Y.: reticulate: Interface to ’Python’ (2019), https://CRAN.Rproject.org/package=reticulate, r package version 1.11.1
 [2] Baker, B., Gupta, O., Naik, N., Raskar, R.: Designing neural network architectures using reinforcement learning. International Conference on Learning Representations (2017)
 [3] Banzhaf, W., Nordin, P., Keller, R.E., Francone, F.D.: Genetic programming: an introduction, vol. 1. Morgan Kaufmann San Francisco (1998)
 [4] Barto, A.G., Mahadevan, S.: Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems 13(12), 41–77 (2003)
 [5] Bergstra, J., Bengio, Y.: Random search for hyperparameter optimization. Journal of Machine Learning Research 13(Feb), 281–305 (2012)
 [6] Bergstra, J., Yamins, D., Cox, D.D.: Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In: Proceedings of the 12th Python in science conference. pp. 13–20. Citeseer (2013)
 [7] Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyperparameter optimization. In: Advances in neural information processing systems. pp. 2546–2554 (2011)
 [8] Binder, M.: mlrCPO: Composable Preprocessing Operators and Pipelines for Machine Learning (2019), https://CRAN.Rproject.org/package=mlrCPO, r package version 0.3.42
 [9] Bischl, B., Casalicchio, G., Feurer, M., Hutter, F., Lang, M., Mantovani, R.G., van Rijn, J.N., Vanschoren, J.: Openml benchmarking suites and the openml100. arXiv preprint arXiv:1708.03731 (2017)
 [10] Bischl, B., Lang, M., Kotthoff, L., Schiffner, J., Richter, J., Studerus, E., Casalicchio, G., Jones, Z.M.: mlr: Machine learning in r. Journal of Machine Learning Research 17(170), 1–5 (2016), http://jmlr.org/papers/v17/15066.html
 [11] Bischl, B., Richter, J., Bossek, J., Horn, D., Thomas, J., Lang, M.: mlrmbo: A modular framework for modelbased optimization of expensive blackbox functions. arXiv preprint arXiv:1703.03373 (2017)
 [12] Brochu, E., Cora, V.M., De Freitas, N.: A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 (2010)
 [13] Dietterich, T.G.: Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research 13, 227–303 (2000)
 [14] Drori, I., Krishnamurthy, Y., Rampin, R., de Paula Lourenco, R., Ono, J.P., Cho, K., Silva, C., Freire, J.: Alphad3m: Machine learning pipeline synthesis. In: AutoML Workshop at ICML (2018)
 [15] Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Advances in Neural Information Processing Systems. pp. 2962–2970 (2015)
 [16] Horn, D., Dagge, M., Sun, X., Bischl, B.: First investigations on noisy modelbased multiobjective optimization. In: International Conference on Evolutionary MultiCriterion Optimization. pp. 298–313. Springer (2017)
 [17] Hutter, F., Hoos, H.H., LeytonBrown, K.: Sequential modelbased optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization. pp. 507–523. Springer (2011)
 [18] Jiulong, Z., Luming, G., Su, Y., Xudong, S., Xiaoshan, L.: Detecting chinese calligraphy style consistency by deep learning and oneclass svm. In: 2017 2nd International Conference on Image, Vision and Computing (ICIVC). pp. 83–86. IEEE (2017)
 [19] Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J.: Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In: Advances in neural information processing systems. pp. 3675–3683 (2016)
 [20] Kushwaha, N., Sun, X., Singh, B., Vyas, O.: A lesson learned from pmf based approach for semantic recommender system. Journal of Intelligent Information Systems 50(3), 441–453 (2018)
 [21] Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: A novel banditbased approach to hyperparameter optimization. The Journal of Machine Learning Research 18(1), 6765–6816 (2017)
 [22] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Humanlevel control through deep reinforcement learning. Nature 518(7540), 529 (2015)
 [23] Mohr, F., Wever, M., Hüllermeier, E.: Mlplan: Automated machine learning via hierarchical planning. Machine Learning 107(810), 1495–1515 (2018)
 [24] Olson, R.S., Moore, J.H.: Tpot: A treebased pipeline optimization tool for automating machine learning. In: Workshop on Automatic Machine Learning. pp. 66–74 (2016)
 [25] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikitlearn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
 [26] de Sá, A.G., Pinto, W.J.G., Oliveira, L.O.V., Pappa, G.L.: Recipe: a grammarbased framework for automatically evolving classification pipelines. In: European Conference on Genetic Programming. pp. 246–261. Springer (2017)
 [27] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017)
 [28] Sun, X., Bommert, A., Pfisterer, F., Rahnenführer, J., Lang, M., Bischl, B.: High dimensional restrictive federated model selection with multiobjective bayesian optimization over shifted distributions (2019)
 [29] Thornton, C., Hutter, F., Hoos, H.H., LeytonBrown, K.: Autoweka: Automated selection and hyperparameter optimization of classification algorithms. CoRR abs/1208.3719 (2012), http://arxiv.org/abs/1208.3719
 [30] Thornton, C., LeytonBrown, K.: Autoweka: Automated selection and hyperparameter optimization of classification algorithms (2012)
 [31] Watkins, C.J., Dayan, P.: Qlearning. Machine learning 8(34), 279–292 (1992)
 [32] Yang, F., Gustafson, S., Elkholy, A., Lyu, D., Liu, B.: Program search for machine learning pipelines leveraging symbolic planning and reinforcement learning. In: Genetic Programming Theory and Practice XVI, pp. 209–231. Springer (2019)
 [33] Yang, F., Lyu, D., Liu, B., Gustafson, S.: Peorl: Integrating symbolic planning and hierarchical reinforcement learning for robust decisionmaking. arXiv preprint arXiv:1804.07779 (2018)
 [34] Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016)