Abstract
Reinforcement learning (RL) is a goaloriented learning solution that hasproven to be successful for Neural Architecture Search (NAS) on the CIFAR andImageNet datasets. However, a limitation of this approach is its highcomputational cost, making it unfeasible to replay it on other datasets.Through metalearning, we could bring this cost down by adapting previouslylearned policies instead of learning them from scratch. In this work, wepropose a deep metaRL algorithm that learns an adaptive policy over a set ofenvironments, making it possible to transfer it to previously unseen tasks. Thealgorithm was applied to various proofofconcept environments in the past, butwe adapt it to the NAS problem. We empirically investigate the agent's behaviorduring training when challenged to design chainstructured neural architecturesfor three datasets with increasing levels of hardness, to later fix the policyand evaluate it on two unseen datasets of different difficulty. Our resultsshow that, under resource constraints, the agent effectively adapts itsstrategy during training to design better architectures than the ones designedby a standard RL algorithm, and can design good architectures during theevaluation on previously unseen environments. We also provide guidelines on theapplicability of our framework in a more complex NAS setting by studying theprogress of the agent when challenged to design multibranch architectures.
Quick Read (beta)
Learning to reinforcement learn for Neural Architecture Search
Abstract
Reinforcement learning (RL) is a goaloriented learning solution that has proven to be successful for Neural Architecture Search (NAS) on the CIFAR and ImageNet datasets. However, a limitation of this approach is its high computational cost, making it unfeasible to replay it on other datasets. Through metalearning, we could bring this cost down by adapting previously learned policies instead of learning them from scratch. In this work, we propose a deep metaRL algorithm that learns an adaptive policy over a set of environments, making it possible to transfer it to previously unseen tasks. The algorithm was applied to various proofofconcept environments in the past, but we adapt it to the NAS problem. We empirically investigate the agent’s behavior during training when challenged to design chainstructured neural architectures for three datasets with increasing levels of hardness, to later fix the policy and evaluate it on two unseen datasets of different difficulty. Our results show that, under resource constraints, the agent effectively adapts its strategy during training to design better architectures than the ones designed by a standard RL algorithm, and can design good architectures during the evaluation on previously unseen environments. We also provide guidelines on the applicability of our framework in a more complex NAS setting by studying the progress of the agent when challenged to design multibranch architectures.
positioning, backgrounds, decorations.pathreplacing, arrows, arrows.meta, calc
Joaquin Vanschoren [email protected]
Department of Mathematics and Computer Science
Eindhoven University of Technology
Eindhoven 5612 AZ, The Netherlands
Keywords: Neural Architecture Search, Deep MetaReinforcement Learning, Image Classification
1 Introduction
Neural networks have achieved remarkable results in many fields, such as that of Image Classification. Crucial aspects of this success are the choice of the neural architecture and the chosen hyperparameters for the particular dataset of interest; however, this is not always straightforward. Although stateoftheart neural networks can inspire the design of other architectures, this process heavily relies on the designer’s level of expertise, making it a challenging and cumbersome task that is prone to deliver underperforming networks.
In an attempt to overcome these flaws, researchers have explored various techniques under the name of Neural Architecture Search (NAS) (Elsken et al., 2018). In NAS, the ultimate goal is to come up with an algorithm that takes any arbitrary dataset as input and outputs a wellperforming neural network for some learning task of interest, so that we can accelerate the design process and remove the dependency on human intervention. Nevertheless, coming up with a solution of this kind is a complicated endeavor where researchers have to deal with several aspects such as the type of the networks that they consider, the scope of the automation process, or the search strategy applied. A particular search strategy for NAS is reinforcement learning (RL), where a socalled agent learns how to design neural networks by sampling architectures and using their numeric performance on a specific dataset as the reward signals that guide the search. Popular standard RL algorithms such as Qlearning or Reinforce have been used to design stateoftheart Convolutional Neural Networks (CNNs) for classification tasks on the CIFAR and ImageNet datasets (Pham et al., 2018; Cai et al., 2018; Zhong et al., 2018; Zoph and Le, 2016; Baker et al., 2016), but little attention is paid to deliver architectures for other datasets. In an attempt to fill that gap, a suitable alternative is deep metaRL (Wang et al., 2016; Duan et al., 2016), where the agent acts on various environments to learn an adaptive policy that can be transferred to new environments.
In this work, we apply deep metaRL to NAS, which, to the best of our knowledge, is a novel contribution. The environments that we consider are associated with standard image classification tasks on datasets with different levels of hardness sampled from a metadataset (Triantafillou et al., 2019). Our main experiments focus on the design of chainstructured networks and show that, under resource constraints, the resulting policy can adapt to new environments, outperform standard RL, and design better architectures than the ones inspired by stateoftheart networks. We also experiment with extending our approach to the design of multibranch architectures so that we can give directions for future work.
The remainder of this report is structured as follows. First, in Section 2, we introduce the preliminary concepts required to understand our work. Next, in Section 3, we discuss the related work for both reinforcement learning and NAS. In Section 4, we formally introduce our methodology, and in Section 5, the framework developed to implement it. In Section 6, we define the experiments, and in Section 7, we show the results. Finally, in Section 8, the conclusions are set out.
2 Preliminaries
2.1 Reinforcement learning
Reinforcement learning (RL) is an approach to automate goaldirected learning (Sutton and Barto, 2012). It relies on two entities that interact with each other: an environment that delivers information of its state, and an agent that using such information learns how to achieve a goal in the environment. The interaction is a bilateral communication where the agent performs actions to modify the state of the environment, which responds with a numeric reward measuring how good the action was to achieve the goal. Typically, the sole interest of the agent is to improve its decisionmaking strategy, known as the policy, to maximize the total reward received over the whole interaction trial since this will lead it to the desired goal. More strictly, RL is formalized using finite Markov Decision Processes (MDPs) as in Definition 1 borrowed from Duan et al. (2016), resulting in the agentenvironment interaction illustrated in Figure 1.
Definition 1 (Reinforcement Learning)
We define a discretetime finitehorizon discounted MDP $M\mathrm{=}\mathrm{(}\mathrm{X}\mathrm{,}\mathrm{A}\mathrm{,}\mathrm{P}\mathrm{,}r\mathrm{,}{\rho}_{\mathrm{0}}\mathrm{,}\gamma \mathrm{,}T\mathrm{)}$, in which $\mathrm{X}$ is a state set, $\mathrm{A}$ an action set, $\mathrm{P}\mathrm{:}\mathrm{X}\mathrm{\times}\mathrm{A}\mathrm{\times}\mathrm{X}\mathrm{\mapsto}{\mathrm{R}}_{\mathrm{+}}$ a transition probability distribution, $r\mathrm{:}\mathrm{X}\mathrm{\times}\mathrm{A}\mathrm{\mapsto}\mathrm{[}\mathrm{}{R}_{m\mathit{}a\mathit{}x}\mathrm{,}{R}_{m\mathit{}a\mathit{}x}\mathrm{]}$ a bounded reward function, ${\rho}_{\mathrm{0}}\mathrm{:}\mathrm{X}\mathrm{\mapsto}{\mathrm{R}}_{\mathrm{+}}$ an initial state distribution, $\gamma \mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{]}$ a discount factor, and $T$ the horizon. Reinforcement learning typically aims to optimize a stochastic policy ${\pi}_{\theta}\mathrm{:}\mathrm{X}\mathrm{\times}\mathrm{A}\mathrm{\mapsto}{\mathrm{R}}_{\mathrm{+}}$ by maximizing the expected reward, modeled as $\eta \mathit{}\mathrm{(}{\pi}_{\theta}\mathrm{)}\mathrm{=}{\mathrm{E}}_{\tau}\mathit{}\mathrm{[}{\mathrm{\sum}}_{t\mathrm{=}\mathrm{0}}^{T}{\gamma}^{t}\mathit{}r\mathit{}\mathrm{(}{x}_{t}\mathrm{,}{a}_{t}\mathrm{)}\mathrm{]}$, where $\tau \mathrm{=}\mathrm{(}{s}_{\mathrm{0}}\mathrm{,}{\alpha}_{\mathrm{0}}\mathrm{,}\mathrm{\dots}\mathrm{)}$ denotes the whole trajectory, ${x}_{t}\mathrm{\in}\mathrm{X}$, ${x}_{\mathrm{0}}\mathrm{\sim}{\rho}_{\mathrm{0}}\mathit{}\mathrm{(}{x}_{\mathrm{0}}\mathrm{)}$, ${a}_{t}\mathrm{\in}\mathrm{A}$, ${a}_{t}\mathrm{\sim}{\pi}_{\theta}\mathit{}\mathrm{(}{a}_{t}\mathrm{}{x}_{t}\mathrm{)}$, and ${x}_{t\mathrm{+}\mathrm{1}}\mathrm{\sim}\mathrm{P}\mathit{}\mathrm{(}{x}_{t\mathrm{+}\mathrm{1}}\mathrm{}{x}_{t}\mathrm{,}{a}_{t}\mathrm{)}$.
2.2 Neural Architecture Search
Neural Architecture Search (NAS) is the process of automating the design of neural networks. In order to formalize this definition, it is convenient to refer to the survey of Elsken et al. (2018), which characterize a NAS work with three variables: the search space, the search strategy, and the performance estimation strategy. Figure 2 illustrates the interaction between these variables.
The search space is the set of architectures considered in the search process. It is possible to define different spaces by constraining attributes of the networks, such as the maximum depth allowed, the type of layers to use, or the connections permitted between layers. A common abstraction inspired in popular networks is to separate the search spaces in chain structures and multibranch structures that can be either complete neural networks or cells that can be used to build more complex networks, as illustrated in Figure 3.
On the other hand, the search strategy is simply the algorithm used to perform the search. The choices range from naive approaches such as random search to more sophisticated ones like reinforcement learning (Baker et al., 2016; Zoph and Le, 2016), evolutionary algorithms (Real et al., 2018), or gradient descent search (Liu et al., 2018).
Lastly, the performance estimation strategy is the function used to measure the goodness of the sampled architectures. Formally, it is a function ${R}_{D}:\mathcal{X}\mapsto \mathbb{R}$ evaluating an architecture on a dataset $D$. The vanilla estimation strategy is the test accuracy after training of a network, but different alternatives are proposed to try delivering an accurate estimate in a short time since expensive training creates a bottleneck in the search process.
3 Related work
3.1 Reinforcement learning
The key to reinforcement learning is the algorithm used to optimize the policy ${\pi}_{\theta}$ (see Definition 1). Through the years, researchers have proposed different algorithms, such as Reinforce (Williams, 1992), Qlearning (Watkins and Dayan, 1992), ActorCritic (Konda, 2002), DeepQNetwork (DQN) (Mnih et al., 2013), Trust Region Policy Optimization (TRPO) (Schulman et al., 2015), and Proximal Policy Optimization (PPO) (Schulman et al., 2017). These algorithms have successfully solved problems in a variety of fields, from robotics (Kober et al., 2013) and video games (Bouzy and Chaslot, 2006; Mnih et al., 2013) to trafficcontrol (Arel et al., 2010) or computational resources management (Mao et al., 2016), showing the power and utility of the reinforcement learning framework. Despite its success, a theoretical flaw of RL is that the policy learned only captures the onetoone stateaction relation of the environment in question, making it necessary to perform individual runs for every new environment of interest. The latter is a costly trait of this standard form of RL since it typically requires thousands of steps to converge to an optimal policy.
A research area that addresses this problem is metaRL, in which agents are trained to learn transferable policies that do not require training from scratch on new problems. We identify two types of metaRL algorithms: the ones that learn a good initialization for the neural networks representing the policy^{1}^{1} 1 The term deep metareinforcement learning comes from the usage of deep models, such as neural networks, to represent the policy to be learned., and the ones that learn policies that can adapt their decisionmaking strategy to new environments, ideally without further training. First, Model Agnostic MetaLearning (MAML) (Finn et al., 2017) learns an initialization that allows fewshot learning in RL; however, it relies on the strong assumption of working with a distribution of similar environments, which cannot always be guaranteed, thus limiting its scope. Second, two algorithms have been proposed to learn policies that, once deployed, can adapt their decisionmaking strategy to new environments, ideally without further training: Learning to reinforcement learn (Wang et al., 2016) and RL${}^{2}$ (Duan et al., 2016). These methods aim to learn a more sophisticated policy modeled by a Recurrent Neural Network (RNN) that captures the relation between states, actions, and metadata of past actions. What emerges is a policy that can adapt its strategy to different environments. The main difference between the two works is the set of environments considered: for Wang et al. (2016) they come from a parameterized distribution, whereas for Duan et al. (2016) they are relatively unrelated. The latter is an essential advantage over MAML, making them more suitable for scenarios where a distribution of environments cannot be guaranteed. Third, Simple Neural AttentIve Learner (SNAIL) extends the idea behind Learning to reinforcement learn and RL${}^{2}$ by using a more powerful temporalfocused model than the simple RNN. We note that none of these approaches have been applied to NAS.
3.2 Neural Architecture Search
As introduced earlier in Section 2.2, it is possible to address Neural Architecture Search (NAS) in different ways. Remarkable results have been achieved by applying optimization techniques such as Bayesian optimization, evolutionary algorithms, gradientbased search, and reinforcement learning. We are interested in reinforcement learning for NAS, due to the variety of works that have achieved stateoftheart results. For other work in NAS, we refer to the survey of Elsken et al. (2018).
Although the ultimate goal of NAS is to come up with a straightforward fullyautomated solution that can deliver a neural architecture for a machine learning task on any dataset of interest, there exist several factors that impede that ambition. Perhaps the most important of these factors is the high computational cost of NAS with reinforcement learning, which imposes constraints on different elements that impact the scope of the solutions. The first bottleneck is the computation of the reward, which typically is the test accuracy of the sampled architectures after training. Because of the expensiveness of such computation, researchers have proposed various performance estimation strategies to avoid expensive training procedures, and they have also imposed some constraints over the search space considered so that a lower number of architectures get sampled and evaluated. For the first aspect, we observe several relaxations: reducing the number of training epochs (Baker et al., 2016; Zhong et al., 2018), sharing of weights between similar networks (Cai et al., 2018; Pham et al., 2018), or entirely skipping the trainevaluation procedure by predicting the performance of the architectures (Zhong et al., 2018). Although these alternatives have successfully reduced the computation time, they pay little attention to the effect of their potentially unfair estimations on the decisionmaking of the agent, and therefore, one should treat them carefully. On the other hand, for the search space, crucial choices are the cardinality of the space and the complexity of the architectures. Some researchers opt for ample spaces with various types of layers and no restrictions in their connections (Zoph and Le, 2016; Zoph et al., 2017; Zhong et al., 2018), whereas others prefer them smaller, such as a chainstructured space (Baker et al., 2016), or a multibranch space modeled as a fullyconnected graph with low cardinality (Pham et al., 2018).
It is important to note that an approach can dramatically reduce its computation time with relaxations on the search space alone. For instance, the same methodology can decrease its computational cost by a factor of 7 (28 days to 4 days, with hundreds of GPUs in both cases) if the space is restricted in the types of layers and the number of elements allowed in the architectures (Zoph and Le, 2016; Zoph et al., 2017). Furthermore, a constrained search space used jointly with some performance estimation strategy can reduce the cost to only 1 day with 1 GPU such as in BlockQNNV2 (Zhong et al., 2018) and ENAS (Pham et al., 2018); however, this drastic reduction of the computational time should be treated with caution. In the case of BlockQNNV2, the estimation of the performance of the networks (i.e., accuracy at a given epoch) depends on a surrogate prediction model that is not studied in detail by the authors, thus leaving room for potentially wrong predictions. On the other hand, a recent investigation (Singh et al., 2019) shows that the quality of the networks delivered by ENAS is not a consequence of reinforcement learning, but of the search space, which contains a majority of wellperforming architectures that can be explored with a less expensive procedure such as random search, therefore losing its character of artificially smart search.
Another factor impacting a NAS with reinforcement learning work is the input dataset used. Although they usually transfer the best CIFARbased architecture designed by the agent to the ImageNet dataset (Baker et al., 2016; Zoph and Le, 2016; Zoph et al., 2017; Cai et al., 2018; Pham et al., 2018), none of them make the agent design networks for other datasets. Furthermore, none of the works give insight on how using a different dataset could affect the complexity of the search. We believe that the lack of study for other datasets is ascribed to the costly taskoriented design of the reinforcement learning algorithms used, Qlearning and Reinforce, that requires to train the agent from scratch for every environment (i.e., a dataset) of interest. The authors do not justify the choice of these algorithms; hence, it would be desirable to study other reinforcement learning algorithms in the same NAS scenarios.
4 Methodology
We aim to improve the performance of Neural Architecture Search (NAS) with reinforcement learning (RL) by using metalearning. We, therefore, build a metaRL system that can learn across environments and adapt to them. We split the system into two components: the NAS variables and the reinforcement learning framework. For the reinforcement learning framework, we make use of a deep metaRL algorithm that follows the same line of Learning to reinforcement learn (Wang et al., 2016) and RL${}^{\mathrm{2}}$ (Duan et al., 2016), with some minor adaptations in the metadata employed and the design of the episodes. The environments that we consider are neural architecture design tasks for different datasets sampled from the metadataset collection (Triantafillou et al., 2019). On the other hand, for the NAS elements, we work with a slightly modified version of the search space of BlockQNN (Zhong et al., 2018) and similarly, we use the test accuracy after earlystop training as the reward associated with the sampled architectures. In the remainder of the section, we discuss these elements further.
4.1 The Neural Architecture Search elements
As described in Section 2.2, three NAS variables characterize a research work in this area: the search strategy, the search space, and the performance estimation strategy. In our case, we constrain the search strategy to a deep metareinforcement learning algorithm that we explain in detail in Section 4.2, and thus, here we only elaborate on the remaining two.
4.1.1 The search space
The set of architectures considered in our work is inspired by BlockQNN (Zhong et al., 2018), which defines the search space as all architectures that can be generated by sequentially stacking $d\in \mathbb{N}$ vectors from a socalled Network Structure Code (NSC) space containing encodings of the most relevant layers for CNNs. An NSC vector has information of the type of a layer, the value of its most important hyperparameter, its position on the network, and the allowed incoming connections (i.e., the inputs) so that it becomes possible to represent any architecture as a list of NSCs. The NSC definition is flexible in that it can easily be modified or extended and, moreover, it allows us to define an equivalent discrete action space for the reinforcement learning agent as described in Section 4.2.2.
In Table 1, we present the NSC space for our implementation. Given a list of NSC vectors representing an architecture, the network is built following the next rules: firstly, based on BlockQNN’s results, if a convolution layer is found then a Preactivation Convolutional Cell^{2}^{2} 2 The PCC stacks a ReLU, a convolution, and a batch normalization unit. (PCC) (He et al., 2016) with 32 units^{3}^{3} 3 The selection of the number of units is made to reduce the cost of the training of the networks. is used; secondly, the concatenation and addition operations create padded versions of their inputs if they have different shapes; thirdly, if at the end of the building process the network has two or more leaves then they get merged with a concatenation operation^{4}^{4} 4 The last two rules do not apply for chainstructured architectures since no merge operations are needed.. Figure 4 illustrates these rules.
Name  Index  Type  Kernel size^{5}^{5} 5 The kernel size is an attribute for the convolutions, whereas for the pooling elements it refers to the layer’s pool size.  Predecessor 1  Predecessor 2 

Convolution  T  1  {1, 3, 5}  K  $\mathrm{\varnothing}$ 
Max Pooling  T  2  {2, 3}  K  $\mathrm{\varnothing}$ 
Average Pooling  T  3  {2, 3}  K  $\mathrm{\varnothing}$ 
Addition  T  5  $\mathrm{\varnothing}$  K  K 
Concatenation  T  6  $\mathrm{\varnothing}$  K  K 
Terminal  T  7  $\mathrm{\varnothing}$  $\mathrm{\varnothing}$  $\mathrm{\varnothing}$ 
4.1.2 The performance estimation strategy
Our estimation of the longterm performance of the designed networks closely follows the earlystop approach of BlockQNNV1 (Zhong et al., 2018), but we ignore the penalization of the network’s FLOPs and density since we have empirically ascertained that it is too strict when the classification task is difficult (i.e., when low accuracy values are expected).
The choice of an earlystop strategy is made to help reduce the computational cost of our approach. In short, for every sampled architecture $\mathcal{N}$ a prediction module is appended, and the network is then trained for a small number of epochs to obtain its accuracy on a test set, which is the final estimation of its longterm performance. The datasets considered are balanced, and their train and test splits are designed beforehand (see Section 4.2.3).
As in BlockQNN, for the prediction module we stack a fullyconnected dense layer with 1024 units and ReLU activation function, a dropout layer with rate of 0.4, a dense layer with the number of units equals to the desired number of classes to predict and linear activation function, and a softmax that outputs the probabilities per class. The training is performed to minimize the crossentropy loss using the Adam Optimizer (Kingma and Ba, 2015) with the parameters used in BlockQNN: ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, ${\u03f5}_{\mathrm{\text{adam}}}=10{e}^{8}$, and ${\alpha}_{\mathrm{\text{adam}}}=0.001$ that is reduced by a factor of 0.2 every five epochs. After training, the network is evaluated on a test set by fixing the network’s weights and selecting the class with the highest probability to be the final prediction per observation in the set so that the standard accuracy ${\text{ACC}}_{\mathcal{N}}$ can be returned.
4.2 The reinforcement learning framework
The deepmetaRL framework that we propose is different from standard RL in two main aspects. First, the agent is challenged to face more than one environment during training, and second, the distribution over the reward domain learned by the agent is now dependant on the whole history of states, actions, and rewards, instead of the simple stateaction pairs. In the remainder of the section, we describe each of the RL elements.
4.2.1 The states
A state ${x}_{i}\in \mathcal{X}$ is a multidimensional array of size $d\times 5$, storing $d$ NSC vectors sorted by layer index. While this representation is programmatically easy to control, it is not ideal in a machine learning setting. In particular, we note that every element of an NSC vector is a categorical variable. Therefore, when required, every NSC vector in ${x}_{t}$ is transformed as follows: the layer’s type^{6}^{6} 6 This size is the result of having 7 types of layers (see Section 4.1.1) plus the type 0 representing an empty layer. is encoded into a onehot vector of size 8, the predecessors into a onehot vector of size $(d+1)$, and the kernel size into a onehot vector of size $(k+1)$ with $k=\mathrm{max}(\text{kernel\_size})$. The transformation ignores the layer index because the state implicitly incorporates the information of the position of each layer due to sorting. This encoding results in a multidimensional array^{7}^{7} 7 When working with chainstructured networks the second predecessor is always omitted, reducing the dimensionality of the encoding to $d\times (d+k+10)$. of size $d\times (2d+k+11)$. Figure 5 illustrates this transformation.
4.2.2 The action space
We formulate the action space $\mathcal{A}$ as a discrete space of 14 actions listed in Table 2. Each action ${a}_{i}\in \mathcal{A}$ can either append a new element from the NSC space to a state ${x}_{j}\in \mathcal{X}$ or control two pointers, ${p}_{1}$ and ${p}_{2}$, for the indices of the predecessors to use for the next NSC vector. We note that for the chainedstructured networks no pointers are required since the predecessor is always the previous layer, and neither do the merging operations addition and concatenation, making it possible to reduce the action space to 8 actions only.
Action ID  Description 

A0  Add convolution with $\text{kernel\_size}=1$, using predecessor ${p}_{1}$ 
A1  Add convolution with $\text{kernel\_size}=3$, using predecessor ${p}_{1}$ 
A2  Add convolution with $\text{kernel\_size}=5$, using predecessor ${p}_{1}$ 
A3  Add maxpooling with $\text{pool\_size}=2$, using predecessor ${p}_{1}$ 
A4  Add maxpooling with $\text{pool\_size}=3$, using predecessor ${p}_{1}$ 
A5  Add avgpooling with $\text{pool\_size}=2$, using predecessor ${p}_{1}$ 
A6  Add avgpooling with $\text{pool\_size}=3$, using predecessor ${p}_{1}$ 
A7  Add terminal state. 
A8  Add addition with predecessors ${p}_{1}$ and ${p}_{2}$ 
A9  Add concatenation with predecessors ${p}_{1}$ and ${p}_{2}$ 
A10  Shift ${p}_{1}$ one position up (i.e., ${p}_{1}={p}_{1}+1$) 
A11  Shift ${p}_{1}$ one position down (i.e., ${p}_{1}={p}_{1}1$) 
A12  Shift ${p}_{2}$ one position up (i.e., ${p}_{2}={p}_{2}+1$) 
A13  Shift ${p}_{2}$ one position down (i.e., ${p}_{2}={p}_{2}1$) 
4.2.3 The environments
In our work, an environment is a neural architecture design task for image classification on a specific dataset of interest. The goal for an agent on this environment is to come up with the best architecture possible after interacting for a certain number of timesteps. At any timestep $t$, the environment’s state is ${x}_{t}\in \mathcal{X}$, which is the NSC representation of a neural network ${N}_{t}$. The reward ${r}_{t}\in [0,1]$ associated with ${x}_{t}$ is a function of the network’s accuracy ${\text{ACC}}_{{N}_{t}}\in [0,1]$ (Section 4.1.2). The initial state ${x}_{0}$ of the environment is an empty architecture.
An agent can interact with the environment through a set of episodes by performing actions ${a}_{t}\in \mathcal{A}$. In our terminology, an episode is the trajectory from a reset of the environment’s state until a termination signal. The environment triggers the termination signal in the following cases: a) the predecessors ${p}_{1}$ and ${p}_{2}$ are out of bounds after the execution of ${a}_{t}$, b) ${a}_{t}$ is a terminal action, c) ${x}_{t}$ contains $d$ NSC elements (the maximum depth) after performing ${a}_{t}$, d) the total number of actions executed in the current episode is higher than a given number $\tau $, or e) the action led to an invalid architecture. The agentenvironment interaction process is formalized in Algorithm 4.2.3.
{algorithmic}[1] \ProcedureinteractAgent, Environment, Dataset, ${t}_{max}$, $\sigma $ \State$done\leftarrow \text{False}$ \State$t\leftarrow 0$ \StateEnvironment.reset_to_initial_state() \While$$ \State${a}_{t}\leftarrow \text{Agent.get\_next\_action()}$ \State${x}_{t}\leftarrow \text{Environment.update\_state(}{a}_{t}\text{)}$ \State$N\leftarrow \text{Environment.build\_network(}{x}_{t}\text{)}$ \State${\text{ACC}}_{{N}_{t}}\leftarrow \text{N.accuracy(Dataset)}$ \If${a}_{t}$ is shifting \State${r}_{t}\leftarrow \sigma \cdot {\text{ACC}}_{{N}_{t}}$ \Else\State${r}_{t}\leftarrow {\text{ACC}}_{{N}_{t}}$ \EndIf
$done\leftarrow \text{Environment.is\_termination()}$
$\text{Agent.learn(}{x}_{t},{a}_{t},{r}_{t},done\text{)}$
$done$ \StateEnvironment.reset_to_initial_state() \State$done\leftarrow \text{False}$ \EndIf\EndWhile\EndProcedure
As mentioned in the beginning of Section 4, we work with more than one environment. Specifically, we define five environments, each one associated with a different dataset sampled from the metadataset collection (Triantafillou et al., 2019). The datasets are listed in Table 3 and they were selected as explained in Appendix A. All datasets have balanced classes. In order to evaluate the accuracy of a network ${N}_{t}$, for any dataset we perform a deterministic 1/3 traintest split and follow the preprocessing that has been initially proposed by the metadataset authors so that the images are resized to a shape of $84\times 84\times 3$ using bilinear interpolation.
Dataset ID  Dataset name  Usage  N classes  N observations 

aircraft  FGVCAircraft  Validation  100  10000 
cu_birds  CUB2002011  Validation  200  11788 
dtd  Describable Textures  Train  47  5640 
omniglot  Omniglot  Train  1623  32460 
vgg_flower  VGG Flower  Train  102  8189 
.
4.2.4 Deep metareinforcement learning
Our deep metaRL approach, illustrated in Figure 6, is based on the work of Wang et al. (2016) and Duan et al. (2016). They propose to learn a policy that, in addition to the stateaction pairs of standard RL, uses the current timestep in the agentenvironment interaction (the temporal information) as well as the previous action and reward. In this way, the agent can learn the relation between its past decisions and the current action. However, we introduce a modification in the temporal information, by considering the relative step within an episode instead of the global timestep so that the agent can capture the relation between changes in a neural architecture.
Formally, let $\mathcal{D}$ be a set of Markov Decision Processes (MDPs). Consider an agent embedding a Recurrent Neural Network (RNN)  with internal state $h$  modeling a policy $\pi $. At the start of a trial, a new task ${m}_{i}\in \mathcal{D}$ is sampled, and the internal state $h$ is set to zeros (empty network). The agent then executes its actionselection strategy for a certain number ${t}_{max}$ of discrete timesteps, performing $n$ episodes of maximum length $l$ depending on the environment’s rules. At each step $t$ (with $0\le t\le {t}_{max}$) an action ${a}_{t}\in A$ is executed as a function of the observed history ${H}_{t}=\{{x}_{0},{a}_{0},{r}_{0},{c}_{0},\mathrm{\dots},{x}_{t1},{a}_{t1},{r}_{t1},{c}_{t1},{x}_{t}\}$ (set of states ${\{{x}_{s}\}}_{0\le s\le t}$, actions $$, rewards $$, episoderelated steps ${\{{c}_{s}\}}_{0\le s\le l}$) and a reward ${r}_{t}$ is obtained. At the very beginning of the trial, the action ${a}_{0}$ is sampled at random from a uniform distribution of all actions available, and the state ${x}_{0}$ is given by the environment’s rules. The RNN’s weights are trained to maximize the total discounted reward accumulated during each trial. The evaluation consists of resetting $h$ and fixing $\pi $ to run an interaction with a new MDP ${m}_{e}\notin \mathcal{D}$.
4.2.5 The policy optimization algorithm
Similarly to Wang et al. (2016), we make use of the synchronous Advantage ActorCritic (A2C) (Mnih et al., 2016) with one worker. As it can be observed in Figure 7, the only change in the A2C network is in the input of the recurrent unit, so that the updates of the network’s parameters remain unchanged.
Formally, let $t$ be the current time step, ${s}_{t}={x}_{t}\cdot {a}_{t1}\cdot {r}_{t1}\cdot {c}_{t}$ a concatenation of inputs, $\pi ({a}_{t}{s}_{t};\theta )$ the policy, $V({s}_{t};{\theta}_{v})$ the value function, $H$ the entropy, $j\in \mathbb{N}$ the horizon, $\gamma \in (0,1]$ the discount factor, $\eta $ the regularization coefficient, and ${R}_{t}={\sum}_{i=0}^{j1}{\gamma}^{i}{r}_{t+i}$ the total accumulated return from time step $t$. The gradient of the objective function is:
$${\nabla}_{\theta}\mathrm{log}\pi ({a}_{t}{s}_{t};\theta )\underset{\text{Advantage estimate}}{\underset{\u23df}{({R}_{t}V({s}_{t};{\theta}_{v}))}}+\underset{\text{Entropy regularization}}{\underset{\u23df}{\eta {\nabla}_{\theta}H(\pi ({s}_{t};\theta ))}}$$  (1) 
As it is usually the case for A2C, the parameters $\theta $ and ${\theta}_{v}$ are shared except for the ones in output layers. For a detailed description of the algorithm, we refer to the original paper (Mnih et al., 2016).
5 Evaluation framework
The current Neural Architecture Search (NAS) solutions lack a crucial element: an opensource framework for reproducibility and further research. Specifically for NAS with reinforcement learning, it would be desirable to build on a programming interface that allows researchers to explore the effect of different algorithms on the same NAS environment. In an attempt to fill this gap, we have developed the nasgym^{8}^{8} 8 Source code available at: github.com/gomerudo/nasenv, a python OpenAI Gym (Brockman et al., 2016) environment that can jointly be used with all the reinforcement learning algorithms exposed in the OpenAI baselines (Dhariwal et al., 2017).
We make use of the objectoriented paradigm to abstract the most essential elements of NAS as a reinforcement learning problem, resulting in a system that can be extended to perform new experiments, as displayed in Figure 8. Although the defaults in the nasgym are the elements in our methodology, the system allows us to easily modify the key components, such as the performance estimation strategy, the action space, or the Neural Structure Code space. We also provide an interface to use a database of experiments that can help to store previously computed rewards, thus reducing the computation time of future trials. All the deep learning components are built with TensorFlow v1.12 (Abadi et al., 2015).
Additionally to the nasgym, we implement the meta version of the A2C algorithm on top of the OpenAI baselines^{9}^{9} 9 Source code available at: github.com/gomerudo/openaibaselines. We believe that this software engineering effort will help to compare, reproduce, and develop future research in NAS.
6 Experiments
To evaluate our framework, we conduct three experiments. The first two aim to study the behavior of the agent when challenged to design chainstructured networks, and the third one is intended to observe its behavior in the multibranch setting. We empirically assess the quality of the networks designed by the agent through episodes, the ability of the agent to adapt to each environment, and the runtimes of the training trials.
6.1 Chainstructured networks
Experiment 1: evolution during training. The agent learns from the three train environments listed in Table 3, using deep metaRL. It starts in the omniglot environment, continues in vgg_flower, and finishes in dtd so that it faces increasingly harder classification tasks (see Appendix A), and the policy learned in one environment is reused in the next one. The agent interacts with each environment for ${t}_{max}=8000$, ${t}_{max}=7000$, and ${t}_{max}=7000$, respectively so that the agent spends more time in the first environment to develop its initial knowledge. We compare against two baselines: random search and DeepQN with experience replay, where the agent learns a new policy on each environment (i.e., it does not reuse the policy between trials) for ${t}_{max}=6500,5500,\text{and}7000$, respectively. Due to resources and time constraints, all ${t}_{max}$ values were empirically selected according to the behaviour of the rewards (see Section 7). The most relevant hyperparemeters are set as follows:


Environment


$d=10$. The maximum depth of a neural architecture.


$\tau =10$. The maximum length of an episode.




A2C hyperparameters


$j=5$. The number of steps to perform before updating the A2C parameters (see Equation 4.2.5). We set the value to half the maximum depth of the networks to allow the agent to learn before the termination of an episode.


$\gamma =0.9$. The discount factor for the past actions.


$\eta =0.01$. The default in the OpenAI baselines (Dhariwal et al., 2017).


$\alpha =0.001$. The A2C learning rate set as in Learning to reinforcement learn (Wang et al., 2016).




DeepQN


Experience $\text{buffer size}=\frac{{t}_{max}}{2}$.


Target model’s $\text{batch size}=20$.


$\u03f5$ with linear decay from 1.0 to 0.1. The parameter controlling the exploration of the agent.


$\alpha =0.0005$. The default learning rate in the OpenAI baselines (Dhariwal et al., 2017).




Training of the sampled networks


$\text{batch size}=128$.


$\text{epochs}=12$. The value used in BlockQNN (Zhong et al., 2018).


Experiment 2: evaluation of the policy. We fix the policy obtained in Experiment 1. The agent interacts with the evaluation environments, aircraft and cu_birds, and deploys its decisionmaking strategy to design a neural architecture for each dataset. The interaction runs for ${t}_{max}=2000$ to study the performance of the policy in short evaluation trials. At the end of the interaction, we select the best two architectures per environment (i.e., the ones with the highest reward) and train them on the same datasets but applying a more intensive procedure as follows. First, we augment the capacity of the architectures by changing the number of filters in the convolution layers according to the layer’s depth; i.e., $\text{number of units}={2}^{4+i}$ with $i$ being the current count of convolutions while building the network (e.g., $\text{number of units}=32\to 64\to 128$). Second, we stack the prediction module as described in Section 4.1.2, but we increase the number of units in the first dense layer to 4096, we use a learning rate with exponential decay, and we train the network for 100 epochs. Since the datasets that we use are resized to a shape of 84x84x3, it is not fair to compare our resulting accuracy values with those of stateoftheart architectures that assume a higher order of shape (Cui et al., 2018; Guo and Farrell, 2018; Hu and Qi, 2019), and neither is to train our networks (which are designed for a given input size) with bigger images. Hence, based on the baselines of Hu and Qi (2019), we use a VGG19 network (Liu and Deng, 2015) with only two blocks as our baseline on both datasets.
6.2 Multibranch networks
Experiment 3: training on a more complex environment. In this experiment, we extend the search space to multibranch architectures. We consider the omniglot environment only. The goal here is to observe the ability of the agent to design multibranch networks through time; i.e., the number of multibranch structures generated through training. The interaction runs for ${t}_{max}=20000$ timesteps because more exploration is required due to the larger action space. The hyperparameters are the same as in Experiment 1, except that $\tau =20$ and $j=10$ because the trajectories are longer due to the shifting of the pointers controlling the predecessors, and $\text{batch size}=64$ because the concatenation operation can generate networks that require more space in memory. We train the agent from scratch two times varying the parameter $\sigma \in [0.0,0.1]$ (see Section 4.2.3) to study its effect encouraging the generation of multibranch structures.
7 Results
Experiment 1: evolution during training
Figure 9 shows the evolution of the best reward and the accumulated reward per episode (representing the quality of the neural architectures), as well as the episode length (in a chainstructured network this represents the number of layers). We observe that, in the first environment (omniglot), our deep metaRL agent performs worse than DeepQN. Nevertheless, in the second and third environments (vgg_flower, dtd), the agent performs better than the baselines from the very first steps, and more consistently through all episodes (showing less variance). DeepQN only catches up after many more episodes, although it exhibits a faster learning curve, which we ascribe to the linear exploration that makes it to end exploration sooner.
Figure 10 shows the entropy of the policy during training over the different environments, which in the A2C algorithm is related to the level of exploration by the agent (more exploration leads to high entropy). In the first environment, our agent explores the environment for a significant number of timesteps, which translates to the slow increase observed in Figure 8(a). In the second environment, the exploration drops down quickly, except for a short period with increased exploration (timesteps 9005 to 12005). In the last and hardest environment, the agent reexplores the environment to adapt its strategy, leading to a reduction of the episode length (depth of the networks) and, consequently, the accumulated reward does not appear to increase due to the shorter episodes. We believe that exploration causes the drops in episode length in vgg_flower and dtd (Figures 8(e) and 8(h)).
In Figure 11, the proportion of the actions performed by the agent during training is shown. We see that it deployed different strategies per environment. Specifically, we note the changes in proportion for actions A0 (convolution with a kernel of size 1), A3 (maxpooling with pool size of 2), and A7 (the terminal state) when the environment switched from vgg_flower to dtd, suggesting that the agent preferred different layers and depth according to the dataset.
Finally, Table 4 shows the running times per environment for each RL algorithm. Here, we do not observe significant differences considering that once transferred, the policy of the deep metaRL agent designs deeper and more costly networks, as observed in Figures 8(e) and 8(h).
Dataset  Deep metaRL  DQN 

omniglot  11 days 9h  6 days 14h 
vgg_flower  7 days 23h  5 days 15h 
dtd  6 days 17h  6 days 4h 
Total  26 days 1h  18 days 9h 
Experiment 2: evaluation of the policy
The results of replaying the learned policy on completely new datasets are displayed in Figure 12, and the corresponding runtimes are listed in Table 5. They show that the agent immediately finds a good solution (with a deep network), and rewards remain consistent; however, it does not improve over time, which warrants further study (see Section 8). Moreover, the strategy deployed by the agent is not different on each dataset, as it is observed in Figure 13. We confirm that the strategies are not significantly different by performing a Wilcoxon signedrank test with the null hypothesis that the two related paired samples come from the same distribution. The output is a $\text{pvalue}=0.48$ with 95% confidence.
Dataset  Runtime 

aircraft  2 days 6h 
cu_birds  2 days 22h 
Total  5 days 4h 
As we mentioned in Section 6, another result of interest is the performance of the best networks designed by the agent when they follow more intensive training. Table 6 shows the accuracy values obtained. We note that the networks achieve low accuracy, in the majority of the cases worse than random guessing. An important observation is that these low values can be a consequence of the relaxation made in the shape of the images. Whereas stateoftheart architectures on both aircraft and cu_birds work with shapes greater than $200\times 200$, we use a smaller version of $84\times 84$ that might lead to loss of information. Moreover, stateoftheart results for these datasets are usually obtained after data augmentation and use deeper and more complex networks with multibranch structures (Cui et al., 2018; Guo and Farrell, 2018; Hu and Qi, 2019). However, in this experiment, we do not consider any of the latter aspects since we work under resource constraints that force us to make relaxations, and thus a lower accuracy can be expected.
Despite the low values, the architectures for the two datasets designed by the deep metaRL agent outperformed by a significant amount the shortened version of VGG19. This shows that by using the learned policy it is possible to find better architectures than one inspired by stateoftheart networks. A final observation is that the best architecture found by the agent during training did not become the best final network, thus exhibiting that earlystop can underestimate the longterm performance of the networks, which also warrants future work.
Dataset  Deep metaRL (1st)  Deep metaRL (2nd)  VGG19like 

aircraft  49.18 $\pm $ 1.2  50.11 $\mathrm{\pm}$ 1.02  30.85 $\pm $ 10.82 
cu_birds  23.97 $\pm $ 1.28  24.24 $\mathrm{\pm}$ 0.90  6.66 $\pm $ 1.98 
Experiment 3: training on a more complex environment
Figure 14 shows the evolution of the best reward, episode length, and accumulated reward during the multibranch experiment on omniglot. We do not observe differences in the behavior of the agent when using different $\sigma $ values, but we note that it took longer to output meaningful rewards (around episode 3000) when compared to Experiment 1, causing extended runtimes as shown in Table 7.
Runtime  
$\sigma =0.0$  13 days 9h 
$\sigma =0.1$  15 days 14h 
Total  28 days 23h 
However, we stated in Section 6.2 that our main interest in this experiment is to study whether or not the agent can explore multibranch structures. Figure 14(a) shows the entropy of the policy through timesteps, and Figure 14(b) the count of multibranch structures through episodes. We note that during exploration the appearance of multibranch structures is more likely, and after episode 3000 (represented by the vertical line in Figure 14(a)), when exploration drops down, the multibranch structures become less frequent. Furthermore, we found that the proportion of multibranch vs. chainstructured networks is only 1:10, meaning that the agent did not explore multibranch structures aggressively, and settled for chainstructured networks instead. The latter is supported by the proportion of actions displayed in Figure 16, where the actions A813 (related to multibranch structures) are the least frequent.
We believe that a multibranch space requires us to handle differently how the predecessors of the NSC vectors are selected (see Section 4.2.2). Some alternatives are: defining heuristics to chose the predecessors instead of using the shifting operations, assigning other rewards to the actions related to the predecessors, and modifying the hyperparameters of the A2C network to encourage more exploration of the agent in the beginning of the deep metaRL training.
8 Conclusions and future work
In this work, we presented the first application of deep metaRL in the NAS setting. Firstly, we investigated the advantages of deep metaRL against standard RL on the relatively simple scenario of chainstructured architectures. Despite resource limitations (1 GPU only), we observed that a policy learned using deep metaRL can be transferred to other environments and quickly designs architectures with higher and more consistent accuracy than standard RL. Nevertheless, standard RL outperforms metaRL when both learn a policy from scratch. We also note that the metaRL agent exhibited adaptive behavior during the training, changing its strategy according to the dataset in question. Secondly, we analyzed the adaptability of the agent during evaluation (i.e., when the policy’s weights are fixed) and the quality of the networks that it designs for previously unseen datasets. In our experiment, the agent was not able to adapt its strategy to different environments, but the performance of the networks delivered was better than the performance of a humandesigned network, showing that the knowledge developed by our agent in the training environments is meaningful in others. Thirdly, we extended our approach to a more complex NAS scenario with a multibranch search space. In this setting, the metaRL agent was not able to deeply explore the multibranch structures and settled for chainstructured networks instead.
We conclude that deep metaRL does provide an advantage over standard RL when transferring is enabled, and it can effectively adapt its strategy to different environments during training. Moreover, the policy learned can be used to deliver meaningful and wellperforming architectures on previously unseen environments without further training. We believe that it is possible to strengthen our deepmeta RL framework in future work. Specifically, we propose to investigate the following aspects under more powerful computational resources:


Hyperparameter tunning of the A2C components. In Experiment 1, we observed that the learning progress of the metaRL agent is slow. We also noticed a long exploration window in the first environment. In order to improve these aspects, we propose to tune the hyperparameters according to the next intuitions:


$j$: the parameter controlling the number of steps before a learning update. We suggest reducing this value to speed up learning.


$\eta $: the entropy regularization. Experiments varying the range of this hyperparameter are required to observe its impact on the learning curve. Also, different values could be used depending on the hardness of the environments.


$\alpha $: the learning rate. We suggest exploring decay functions for the learning rate to encourage faster learning after exploration.




Duration of the agentenvironment interaction. In Experiment 2, the policy did not exhibit adaptive behavior. A possible explanation is that the training trials were relatively short when compared to other reinforcement learning applications. Training the agent for longer trials could help improve the adaptation of the policy during evaluation.


The action space in the multibranch setting. In Experiment 3, we observed that the agent was not able to explore the multibranch space sufficiently and settled for chainstructured networks instead. Although hyperparameter tuning could also help encourage exploration of the multibranching actions, we believe that redefining the actions is a more suitable area of improvement. In that line, we recommend exploring heuristics based on the number of connections to select the predecessors.


The datasets and the performance estimation strategy. In all experiments, we observed a low accuracy of the networks in the datasets. Since we worked with constrained resources, we applied relaxations to the datasets and the performance estimation strategy to reduce the computational cost, which could have affected the accuracy of the networks. Future work can focus on designing a different set of environments with images with a smaller size, optimizing the performance estimation strategy per dataset, and investigating alternatives to reduce the cost of computing the rewards associated to the networks.


Transforming other standard RL algorithms to a metaRL version. The transformation of the A2C algorithm to a metaRL implementation required to change the input passed to the policy and to rely on a recurrent unit to learn the temporal dependencies between actions. This transformation is possible on other standard RL algorithms, which would help study different metaRL approaches to NAS.


Benchmarking of other RL on the same NAS environments. In Section 5, we introduced the system developed to conduct our experiments, which allows to easily play other RL algorithms from the OpenAI baselines on the same NAS environments. We believe that this system will help to encourage research in these directions so that the benefits of different RL algorithms on NAS can be studied in detail.
Acknowledgments
The authors thank the SURF cooperative for kindly providing the required computational resources.
References
 Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
 Arel et al. (2010) I. Arel, C. Liu, T. Urbanik, and A. G. Kohls. Reinforcement learningbased multiagent system for network traffic signal control. IET Intelligent Transport Systems, 4(2):128–135, June 2010. ISSN 1751956X. doi: 10.1049/ietits.2009.0070.
 Baker et al. (2016) Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. CoRR, abs/1611.02167, 2016. URL http://arxiv.org/abs/1611.02167.
 Bouzy and Chaslot (2006) B. Bouzy and G. Chaslot. Montecarlo go reinforcement learning experiments. In 2006 IEEE Symposium on Computational Intelligence and Games, pages 187–194, May 2006. doi: 10.1109/CIG.2006.311699.
 Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
 Cai et al. (2018) Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. Pathlevel network transformation for efficient architecture search. CoRR, abs/1806.02639, 2018. URL http://arxiv.org/abs/1806.02639.
 Cui et al. (2018) Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge J. Belongie. Large scale finegrained categorization and domainspecific transfer learning. CoRR, abs/1806.06193, 2018. URL http://arxiv.org/abs/1806.06193.
 Dhariwal et al. (2017) Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017.
 Duan et al. (2016) Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl${}^{2}$: Fast reinforcement learning via slow reinforcement learning. CoRR, abs/1611.02779, 2016. URL http://arxiv.org/abs/1611.02779.
 Elsken et al. (2018) Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural Architecture Search: A Survey. arXiv eprints, art. arXiv:1808.05377, Aug 2018.
 Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. CoRR, abs/1703.03400, 2017. URL http://arxiv.org/abs/1703.03400.
 Guo and Farrell (2018) Pei Guo and Ryan Farrell. Finegrained visual categorization using PAIRS: pose and appearance integration for recognizing subcategories. CoRR, abs/1801.09057, 2018. URL http://arxiv.org/abs/1801.09057.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. CoRR, abs/1603.05027, 2016. URL http://arxiv.org/abs/1603.05027.
 Hu and Qi (2019) Tao Hu and Honggang Qi. See better before looking closer: Weakly supervised data augmentation network for finegrained visual classification. CoRR, abs/1901.09891, 2019. URL http://arxiv.org/abs/1901.09891.
 Kingma and Ba (2015) Diederik P Kingma and Jimmy Lei Ba. Adam: A method for stochastic gradient descent. ICLR: International Conference on Learning Representations, 2015.
 Kober et al. (2013) Jens Kober, J. Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. Int. J. Rob. Res., 32(11):1238–1274, September 2013. ISSN 02783649. doi: 10.1177/0278364913495721. URL http://dx.doi.org/10.1177/0278364913495721.
 Konda (2002) Vijaymohan Konda. Actorcritic Algorithms. PhD thesis, Cambridge, MA, USA, 2002. AAI0804543.
 Liu et al. (2018) Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: differentiable architecture search. CoRR, abs/1806.09055, 2018. URL http://arxiv.org/abs/1806.09055.
 Liu and Deng (2015) S. Liu and W. Deng. Very deep convolutional neural network based image classification using small training sample size. In 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pages 730–734, Nov 2015. doi: 10.1109/ACPR.2015.7486599.
 Mao et al. (2016) Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. Resource management with deep reinforcement learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, HotNets ’16, pages 50–56, New York, NY, USA, 2016. ACM. ISBN 9781450346610. doi: 10.1145/3005745.3005750. URL http://doi.acm.org/10.1145/3005745.3005750.
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013. URL http://arxiv.org/abs/1312.5602.
 Mnih et al. (2016) Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016. URL http://arxiv.org/abs/1602.01783.
 Pham et al. (2018) Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. CoRR, abs/1802.03268, 2018. URL http://arxiv.org/abs/1802.03268.
 Real et al. (2018) Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classifier architecture search. CoRR, abs/1802.01548, 2018. URL http://arxiv.org/abs/1802.01548.
 Schulman et al. (2015) John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. CoRR, abs/1502.05477, 2015. URL http://arxiv.org/abs/1502.05477.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
 Singh et al. (2019) Prabhant Singh, Tobias Jacobs, Sebastien Nicolas, and Mischa Schmidt. A Study of the Learning Progress in Neural Architecture Search Techniques. arXiv eprints, art. arXiv:1906.07590, Jun 2019.
 Sutton and Barto (2012) Richard S. Sutton and Andrew G. Barto. Reinforcement learning: an introduction. The MIT Press, 2012.
 Triantafillou et al. (2019) Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, PierreAntoine Manzagol, and Hugo Larochelle. Metadataset: A dataset of datasets for learning to learn from few examples. CoRR, abs/1903.03096, 2019. URL http://arxiv.org/abs/1903.03096.
 Wang et al. (2016) Jane X. Wang, Zeb KurthNelson, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, Rémi Munos, Charles Blundell, Dharshan Kumaran, and Matthew Botvinick. Learning to reinforcement learn. CoRR, abs/1611.05763, 2016. URL http://arxiv.org/abs/1611.05763.
 Watkins and Dayan (1992) Christopher J. C. H. Watkins and Peter Dayan. Qlearning. Machine Learning, 8(3):279–292, May 1992. ISSN 15730565. doi: 10.1007/BF00992698. URL https://doi.org/10.1007/BF00992698.
 Williams (1992) Ronald J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Machine Learning, pages 229–256, 1992.
 Zhong et al. (2018) Zhao Zhong, Zichen Yang, Boyang Deng, Junjie Yan, Wei Wu, Jing Shao, and ChengLin Liu. Blockqnn: Efficient blockwise neural network architecture generation. CoRR, abs/1808.05584, 2018. URL http://arxiv.org/abs/1808.05584.
 Zoph and Le (2016) Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. CoRR, abs/1611.01578, 2016. URL http://arxiv.org/abs/1611.01578.
 Zoph et al. (2017) Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. CoRR, abs/1707.07012, 2017. URL http://arxiv.org/abs/1707.07012.
A Selection of the datasets
The deep metareinforcement learning framework that we implement requires a set of environments associated to image classification tasks. In order to design these environments, we rely on the metadataset (Triantafillou et al., 2019), a collection of 10 datasets with a concrete sampling procedure designed for metalearning in fewshot learning image classification. In our setting, the datasets are intended for standard image classification, thus we redefine the sampling strategy. Our interest is in using small but yet challenging datasets that allow us to save computational resources without making the Neural Architecture Search (NAS) trivial.
In Table 8 the original datasets in the collection are listed. We select the ones that are smaller than CIFAR10 (60K observations), which is the reference for NAS. The datasets satisfying the criterion are aircraft, cu_birds, dtd, omniglot, traffic_sign and vgg_flower. We want to evaluate the hardness of these six datasets to define a sampling procedure from the collection, and thus we perform a short and individual deep metareinforcement learning trial with ${t}_{max}=200$ for each dataset. Since at the beginning of the trial the agent does not develop any significant knowledge, its sampling of architectures is random. In Figure 17 the boxplot and barplot of the obtained accuracy values are presented, and in Table 9 the running time per experiment is shown.
A simple exploratory analysis suggests three types of datasets: a “trivial” dataset with high accuracy values with simple networks (traffic_sign), two “hard” datasets with low accuracy values (all values below 30%: dtd and cu_birds), and three “medium” datasets with more diversity of accuracy values (median around 30% and broader interquartile range: aircraft, omniglot, vgg_flower). On the other hand, for the running times, we can observe that aircraft and cu_birds result in the most expensive runs. Considering the computation time, and the hardness of the classification tasks, we defined the sampling presented in Table 3. Our training datasets have different levels of hardness and reported the least costly runs.
Dataset ID  Dataset name  N classes  N observations 

aircraft  FGVCAircraft  100  10000 
cu_birds  CUB2002011  200  11788 
dtd  Describable Textures  47  5640 
fungi  FGVCx Fungi  1394  89760 
ilsvrc_2012  ImageNet  1000  1280764 
mscoco  Common Objects in Context  80  330000 
omniglot  Omniglot  1623  32460 
quickdraw  Quick, Draw!  345  50426266 
traffic_sign  German Traffic Sign Recognition Benchmark  43  39209 
vgg_flower  VGG Flower  102  8189 
Dataset ID  Time 

aircraft  9h49m 
cu_birds  16h20m 
dtd  5h38m 
omniglot  3h38m 
traffic_sign  4h33m 
vgg_flower  4h56m 
B Networks designed by the deep metaRL agent during training and evaluation
Here we show the best architectures designed by the agent in the three experiments. Figure 18 shows the best architecture per datasets during training (omniglot, vgg_flower, and dtd). Figure 19 and 20 show the best two architectures during evaluation for aircraft and cu_birds respectively. Figure 21 shows the best architectures for the multibranch experiment. For each architecture we report the earlystop accuracy obtained.