Towards modular and programmable architecture search

  • 2019-09-30 00:18:56
  • Renato Negrinho, Darshan Patil, Nghia Le, Daniel Ferreira, Matthew Gormley, Geoffrey Gordon
  • 4

Abstract

Neural architecture search methods are able to find high performance deeplearning architectures with minimal effort from an expert. However, currentsystems focus on specific use-cases (e.g. convolutional image classifiers andrecurrent language models), making them unsuitable for general use-cases thatan expert might wish to write. Hyperparameter optimization systems aregeneral-purpose but lack the constructs needed for easy application toarchitecture search. In this work, we propose a formal language for encodingsearch spaces over general computational graphs. The language constructs allowus to write modular, composable, and reusable search space encodings and toreason about search space design. We use our language to encode search spacesfrom the architecture search literature. The language allows us to decouple theimplementations of the search space and the search algorithm, allowing us toexpose search spaces to search algorithms through a consistent interface. Ourexperiments show the ease with which we can experiment with differentcombinations of search spaces and search algorithms without having to implementeach combination from scratch. We release an implementation of our languagewith this paper.

 

Quick Read (beta)

Towards modular and programmable architecture search

Renato Negrinho1   Darshan Patil1   Nghia Le1   Daniel Ferreira2
Matthew Gormley1   Geoffrey Gordon1,3
Carnegie Mellon University1, TU Wien2, Microsoft Research Montreal3
Part of this work was done while the first author was a research scientist at Petuum.
Abstract

Neural architecture search methods are able to find high performance deep learning architectures with minimal effort from an expert [7]. However, current systems focus on specific use-cases (e.g. convolutional image classifiers and recurrent language models), making them unsuitable for general use-cases that an expert might wish to write. Hyperparameter optimization systems [4, 25, 8] are general-purpose but lack the constructs needed for easy application to architecture search. In this work, we propose a formal language for encoding search spaces over general computational graphs. The language constructs allow us to write modular, composable, and reusable search space encodings and to reason about search space design. We use our language to encode search spaces from the architecture search literature. The language allows us to decouple the implementations of the search space and the search algorithm, allowing us to expose search spaces to search algorithms through a consistent interface. Our experiments show the ease with which we can experiment with different combinations of search spaces and search algorithms without having to implement each combination from scratch. We release an implementation of our language with this paper11 1 Visit https://github.com/negrinho/deep_architect for code and documentation..

 

Towards modular and programmable architecture search


  Renato Negrinho1 thanks: Part of this work was done while the first author was a research scientist at Petuum.   Darshan Patil1   Nghia Le1   Daniel Ferreira2 Matthew Gormley1   Geoffrey Gordon1,3 Carnegie Mellon University1, TU Wien2, Microsoft Research Montreal3

\@float

noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]

1 Introduction

Architecture search has the potential to transform machine learning workflows. High performance deep learning architectures are often manually designed through a trial-and-error process that amounts to trying slight variations of known high performance architectures. Recently, architecture search techniques have shown tremendous potential by improving on handcrafted architectures, both by improving state-of-the-art performance and by finding better tradeoffs between computation and performance. Unfortunately, current systems fall short of providing strong support for general architecture search use-cases.

Hyperparameter optimization systems [4, 25, 8, 14] are not designed specifically for architecture search use-cases and therefore do not introduce constructs that allow experts to implement these use-cases efficiently, e.g., easily writing new search spaces over architectures. Using hyperparameter optimization systems for an architecture search use-case requires the expert to write the encoding for the search space over architectures as a conditional hyperparameter space and to write the mapping from hyperparameter values to the architecture to be evaluated. Hyperparameter optimization systems are completely agnostic that their hyperparameter spaces encode search spaces over architectures.

By contrast, architecture search systems [7] are in their infancy, being tied to specific use-cases (e.g., either reproducing results reported in a paper or concrete systems, e.g., for searching over Scikit-Learn pipelines [18]) and therefore lack support for general architecture search workflows. For example, current implementations of architecture search methods rely on ad-hoc encodings for search spaces, providing limited extensibility and programmability for new work to build on. For example, implementations of the search space and search algorithm are often intertwined, requiring substantial coding effort to try new search spaces or search algorithms.

Contributions

We describe a modular language for encoding search spaces over general computational graphs. We aim to improve the programmability, modularity, and reusability of architecture search systems. We are able to use the language constructs to encode search spaces in the literature. Furthermore, these constructs allow the expert to create new search spaces and modify existing ones in structured ways. Search spaces expressed in the language are exposed to search algorithms under a consistent interface, decoupling the implementations of search spaces and search algorithms. We showcase these functionalities by easily comparing search spaces and search algorithms from the architecture search literature. These properties will enable better architecture search research by making it easier to benchmark and reuse search algorithms and search spaces.

2 Related work

Hyperparameter optimization

Algorithms for hyperparameter optimization often focus on small or simple hyperparameter spaces (e.g., closed subsets of Euclidean space in low dimensions). Hyperparameters might be categorical (e.g., choice of regularizer) or continuous (e.g., learning rate and regularization constant). Gaussian process Bayesian optimization [24] and sequential model based optimization [10] are two popular approaches. Random search has been found to be competitive for hyperparameter optimization [3, 13]. Conditional hyperparameter spaces (i.e., where some hyperparameters may be available only for specific values of other hyperparameters) have also been considered [2, 5]. Hyperparameter optimization systems (e.g. Hyperopt [4], Spearmint [25], SMAC [14, 10] and BOHB [8]) are general-purpose and domain-independent. Yet, they rely on the expert to distill the problem into an hyperparameter space and write the mapping from hyperparameter values to implementations.

Architecture search

Contributions to architecture search often come in the form of search algorithms, evaluation strategies, and search spaces. Researchers have considered a variety of search algorithms, including reinforcement learning [28], evolutionary algorithms [23, 16], MCTS [17], SMBO  [17, 15], and Bayesian optimization [12]. Most search spaces have been proposed for recurrent or convolutional architectures [28, 23, 16] focusing on image classification (CIFAR-10) and language modeling (PTB). Architecture search encodes much of the architecture design in the search space (e.g., the connectivity structure of the computational graph, how many operations to use, their type, and values for specifying each operation chosen). However, the literature has yet to provide a consistent method for designing and encoding such search spaces. Systems such as Auto-Sklearn [9], TPOT [19], and Auto-Keras [11] have been developed for specific use-cases (e.g., Auto-Sklearn and TPOT focus on classification and regression of featurized vector data, Auto-Keras focus on image classification) and therefore support relatively rigid workflows. The lack of focus on extensibility and programmability makes these systems unsuitable as frameworks for general architecture search research.

3 Proposed approach: modular and programmable search spaces

To maximize the impact of architecture search research, it is fundamental to improve the programmability of architecture search tools22 2 cf. the effect of highly programmable deep learning frameworks on deep learning research and practice.. We move towards this goal by designing a language to write search spaces over computational graphs. We identify the following advantages for our language and search spaces encoded in it:

  • Similarity to computational graphs: Writing a search space in our language is similar to writing a fixed computational graph in an existing deep learning framework. The main difference is that nodes in the graph may be search spaces rather than fixed operations (e.g., see Figure 5). A search space maps to a single computational graph once all its hyperparameters have been assigned values (e.g., in frame d in Figure 5).

  • Modularity and reusability: The building blocks of our search spaces are modules and hyperparameters. Search spaces are created through the composition of modules and their interactions. Implementing a new module only requires dealing with aspects local to the module. Modules and hyperparameters can be reused across search spaces, and new search spaces can be written by combining existing search spaces. Furthermore, our language supports search spaces in general domains (e.g., deep learning architectures or Scikit-Learn [21] pipelines).

  • Laziness: A substitution module delays the creation of a subsearch space until all hyperparameters of the substitution module are assigned values. Experts can use substitution modules to encode natural and complex conditional constructions by concerning themselves only with the conditional branch that is chosen. This is simpler than the support for conditional hyperparameter spaces provided by hyperparameter optimization tools, e.g., in Hyperopt [4], where all conditional branches need to be written down explicitly. Our language allows conditional constructs to be expressed implicitly through composition of language constructs (e.g., nesting substitution modules). Laziness also allows us to encode search spaces that can expand infinitely, which is not possible with current hyperparameter optimization tools (see Appendix D.1).

  • Automatic compilation to runnable computational graphs: Once all choices in the search space are made, the single architecture corresponding to the terminal search space can be mapped to a runnable computational graph (see Algorithm 4). By contrast, for general hyperparameter optimization tools this mapping has to be written manually by the expert.

4 Components of the search space specification language

A search space is a graph (see Figure 5) consisting of hyperparameters (either of type independent or dependent) and modules (either of type basic or substitution). This section describes our language components and show encodings of simple search spaces in our Python implementation. Figure 5 and the corresponding search space encoding in Figure 4 are used as running examples. Appendix A and Appendix B provide additional details and examples, e.g. the recurrent cell search space of [22].

Independent hyperparameters

The value of an independent hyperparameter is chosen from its set of possible values. An independent hyperparameter is created with a set of possible values, but without a value assigned to it. Exposing search spaces to search algorithms relies mainly on iteration over and value assignment to independent hyperparameters. An independent hyperparameter in our implementation is instantiated as, for example, D([1, 2, 4, 8]). In Figure 5, IH-1 has set of possible values {64,128} and is eventually assigned value 64 (shown in frame d).

Dependent hyperparameters

The value of a dependent hyperparameter is computed as a function of the values of the hyperparameters it depends on (see line 7 of Algorithm 1). Dependent hyperparameters are useful to encode relations between hyperparameters, e.g., in a convolutional network search space, we may want the number of filters to increase after each spatial reduction. In our implementation, a dependent hyperparameter is instantiated as, for example, h = DependentHyperparameter(lambda dh: 2*dh["units"], {"units": h_units}). In Figure 5, in the transition from frame a to frame b, IH-3 is assigned value 1, triggering the value assignment of DH-1 according to its function fn:2*x.

Basic modules

@ifdisplaystyle
1 def one_layer_net():
2     a_in, a_out = dropout(D([0.25, 0.5]))
3     b_in, b_out = dense(D([100, 200, 300]))
4     c_in, c_out = relu()
5     a_out["out"].connect(b_in["in"])
6     b_out["out"].connect(c_in["in"])
7     return a_in, c_out
\lst
Figure 1: Search space over feedforward networks with dropout rate of 0.25 or 0.5, ReLU activations, and one hidden layer with 100, 200, or 300 units.

A basic module implements computation that depends on the values of its properties. Search spaces involving only basic modules and hyperparameters do not create new modules or hyperparameters, and therefore are fixed computational graphs (e.g., see frames c and d in Figure 5). Upon compilation, a basic module consumes the values of its inputs, performs computation, and publishes the results to its outputs (see Algorithm 4). Deep learning layers can be wrapped as basic modules, e.g., a fully connected layer can be wrapped as a single-input single-output basic module with one hyperparameter for the number of units. In the search space in Figure 1, dropout, dense, and relu are basic modules. In Figure 5, both frames c and d are search spaces with only basic modules and hyperparameters. In the search space of frame d, all hyperparameters have been assigned values, and therefore the single architecture can be mapped to its implementation (e.g., in Tensorflow).

Substitution modules

@ifdisplaystyle
1 def multi_layer_net():
2     h_or = D([0, 1])
3     h_repeat = D([1, 2, 4])
4     return siso_repeat(
5         lambda: siso_sequential([
6             dense(D([300])),
7             siso_or([relu, tanh], h_or)
8         ]), h_repeat)
\lst
Figure 2: Search space over feedforward networks with 1, 2, or 4 hidden layers and ReLU or tanh activations.

Substitution modules encode structural transformations of the computational graph that are delayed33 3 Substitution modules are inspired by delayed evaluation in programming languages. until their hyperparameters are assigned values. Similarly to a basic module, a substitution module has hyperparameters, inputs, and outputs. Contrary to a basic module, a substitution module does not implement computation—it is substituted by a subsearch space (which depends on the values of its hyperparameters and may contain new substitution modules). Substitution is triggered once all its hyperparameters have been assigned values. Upon substitution, the module is removed from the search space and its connections are rerouted to the corresponding inputs and outputs of the generated subsearch space (see Algorithm 1 for how substitutions are resolved). For example, in the transition from frame b to frame c of Figure 5, IH-2 was assigned the value 1 and Dropout-1 and IH-7 were created by the substitution of Optional-1. The connections of Optional-1 were rerouted to Dropout-1. If IH-2 had been assigned the value 0, Optional-1 would have been substituted by an identity basic module and no new hyperparameters would have been created. Figure 2 shows a search space using two substitution modules: siso_or chooses between relu and tanh; siso_repeat chooses how many layers to include. siso_sequential is used to avoid multiple calls to connect as in Figure 1.

Auxiliary functions

@ifdisplaystyle
1 def rnn_cell(hidden_fn, output_fn):
2     h_inputs, h_outputs = hidden_fn()
3     y_inputs, y_outputs = output_fn()
4     h_outputs["out"].connect(y_inputs["in"])
5     return h_inputs, y_outputs
\lst
Figure 3: Auxiliary function to create the search space for the recurrent cell given functions that create the subsearch spaces.

Auxiliary functions, while not components per se, help create complex search spaces. Auxiliary functions might take functions that create search spaces and put them together into a larger search space. For example, the search space in Figure 3 defines an auxiliary RNN cell that captures the high-level functional dependency: ht=qh(xt,ht-1) and yt=qy(ht). We can instantiate a specific search space as rnn_cell(lambda: siso_sequential([concat(2), one_layer_net()]), multi_layer_net).

5 Example search space

@ifdisplaystyle
1 def search_space():
2     h_n = D([1, 2, 4])
3     h_ndep = DependentHyperparameter(
4         lambda dh: 2 * dh["x"], {"x": h_n})
5
6     c_inputs, c_outputs = conv2d(D([64, 128]))
7     o_inputs, o_outputs = siso_optional(
8         lambda: dropout(D([0.25, 0.5])), D([0, 1]))
9     fn = lambda: conv2d(D([64, 128]))
10     r1_inputs, r1_outputs = siso_repeat(fn, h_n)
11     r2_inputs, r2_outputs = siso_repeat(fn, h_ndep)
12     cc_inputs, cc_outputs = concat(2)
13
14     o_inputs["in"].connect(c_outputs["out"])
15     r1_inputs["in"].connect(o_outputs["out"])
16     r2_inputs["in"].connect(o_outputs["out"])
17     cc_inputs["in0"].connect(r1_outputs["out"])
18     cc_inputs["in1"].connect(r2_outputs["out"])
19     return c_inputs, cc_outputs
\lst
Figure 4: Simple search space showcasing all language components. See also Figure 5.
Figure 5: Search space transitions for the search space in Figure 4 (frame a) leading to a single architecture (frame d). Modules and hyperparameters created since the previous frame are highlighted in green. Hyperparameters assigned values since the previous frame are highlighted in red.

We ground discussion textually, through code examples (Figure 4), and visually (Figure 5) through an example search space. There is a convolutional layer followed, optionally, by dropout with rate 0.25 or 0.5. After the optional dropout layer, there are two parallel chains of convolutional layers. The first chain has length 1, 2, or 4, and the second chain has double the length of the first. Finally, the outputs of both chains are concatenated. Each convolutional layer has 64 or 128 filters (chosen separately). This search space has 25008 distinct models.

Figure 5 shows a sequence of graph transitions for this search space. IH and DH denote type identifiers for independent and dependent hyperparameters, respectively. Modules and hyperparameters types are suffixed with a number to generate unique identifiers. Modules are represented by rectangles that contain inputs, outputs, and properties. Hyperparameters are represented by ellipses (outside of modules) and are associated to module properties (e.g., in frame a, IH-1 is associated to filters of Conv2D-1). To the right of an independent hyperparameter we show, before assignment, its set of possible values and, after assignment, its value (e.g., IH-1 in frame a and in frame d, respectively). Similarly, for a dependent hyperparameter we show, before assignment, the function that computes its value and, after assignment, its value (e.g., DH-1 in frame a and in frame b, respectively). Frame a shows the initial search space encoded in Figure 4. From frame a to frame b, IH-3 is assigned a value, triggering the value assignment for DH-1 and the substitutions for Repeat-1 and Repeat-2. From frame b to frame c, IH-2 is assigned value 1, creating Dropout-1 and IH-7 (its dropout rate hyperparameter). Finally, from frame c to frame d, the five remaining independent hyperparameters are assigned values. The search space in frame d has a single architecture that can be mapped to an implementation in a deep learning framework.

6 Semantics and mechanics of the search space specification language

In this section, we formally describe the semantics and mechanics of our language and show how they can be used to implement search algorithms for arbitrary search spaces.

6.1 Semantics

Search space components

A search space G has hyperparameters H(G) and modules M(G). We distinguish between independent and dependent hyperparameters as Hi(G) and Hd(G), where H(G)=Hi(G)Hd(G) and Hd(G)Hi(G)=, and basic modules and substitution modules as Mb(G) and Ms(G), where M(G)=Mb(G)Ms(G) and Mb(G)Ms(G)=.

Hyperparameters

We distinguish between hyperparameters that have been assigned a value and those that have not as Ha(G) and Hu(G). We have H(G)=Hu(G)Ha(G) and Hu(G)Ha(G)=. We denote the value assigned to an hyperparameter hHa(G) as v(G),(h)𝒳(h), where hHa(G) and 𝒳(h) is the set of possible values for h. Independent and dependent hyperparameters are assigned values differently. For hHi(G), its value is assigned directly from 𝒳(h). For hHd(G), its value is computed by evaluating a function f(h) for the values of H(h), where H(h) is the set of hyperparameters that h depends on. For example, in frame a of Figure 5, for h=DH-1, H(h)={IH-3}. In frame b, Ha(G)={IH-3,DH-1} and Hu(G)={IH-1,IH-4,IH-5,IH-6,IH-2}.

Modules

A module mM(G) has inputs I(m), outputs O(m), and hyperparameters H(m)H(G) along with mappings assigning names local to the module to inputs, outputs, and hyperparameters, respectively, σ(m),i:S(m),iI(m), σ(m),o:S(m),oO(m), σ(m),h:S(m),hH(m), where S(m),iΣ*, S(m),oΣ*, and S(m),hΣ*, where Σ* is the set of all strings of alphabet Σ. S(m),i, S(m),o, and S(m),h are, respectively, the local names for the inputs, outputs, and hyperparameters of m. Both σ(m),i and σ(m),o are bijective, and therefore, the inverses σ(m),i-1:I(m)Sm,i and σ(m),o-1:O(m)S(m),o exist and assign an input and output to its local name. Each input and output belongs to a single module. σ(m),h might not be injective, i.e., |S(m),h||H(m)|. A name sS(m),h captures the local semantics of σ(m),h(s) in mM(G) (e.g., for a convolutional basic module, the number of filters or the kernel size). Given an input iI(M(G)), m(i) recovers the module that i belongs to (analogously for outputs). For mm, we have I(m)I(m)= and O(m)O(m)=, but there might exist m,mM(G) for which H(m)H(m), i.e., two different modules might share hyperparameters but inputs and outputs belong to a single module. We use shorthands I(G) for I(M(G)) and O(G) for O(M(G)). For example, in frame a of Figure 5, for m=Conv2D-1 we have: I(m)={Conv2D-1.in}, O(m)={Conv2D-1.out}, and H(m)={IH-1}; S(m),i={𝚒𝚗} and σ(m),i(𝚒𝚗)=Conv2D-1.in (σ(m),o and σ(m),h are similar); m(Conv2D-1.in)=Conv2D-1. Output and inputs are identified by the global name of their module and their local name within their module joined by a dot, e.g.. Conv2D-1.in

Connections between modules

Connections between modules in G are represented through the set of directed edges E(G)O(G)×I(G) between outputs and inputs of modules in M(G). We denote the subset of edges involving inputs of a module mM(G) as Ei(m), i.e., Ei(m)={(o,i)E(G)iI(m)}. Similarly, for outputs, Eo(m)={(o,i)E(G)oO(m)}. We denote the set of edges involving inputs or outputs of m as E(m)=Ei(m)Eo(m). In frame a of Figure 5, For example, in frame a of Figure 5, Ei(Optional-1)={(Conv2D-1.out,Optional-1.in)} and Eo(Optional-1)={(Optional-1.out,Repeat-1.in),(Optional-1.out,Repeat-2.in)}.

Search spaces

We denote the set of all possible search spaces as 𝒢. For a search space G𝒢, we define (G)={G𝒢G1,,Gm𝒢m,Gk+1=𝚃𝚛𝚊𝚗𝚜𝚒𝚝𝚒𝚘𝚗(Gk,h,v),hHi(Gk)Hu(Gk),v𝒳(h),k[m],G1=G,Gm=G}, i.e., the set of reachable search spaces through a sequence of value assignments to independent hyperparameters (see Algorithm 1 for the description of Transition). We denote the set of terminal search spaces as 𝒯𝒢, i.e. 𝒯={G𝒢Hi(G)Hu(G)=}. We denote the set of terminal search spaces that are reachable from G𝒢 as 𝒯(G)=(G)𝒯. In Figure 5, if we let G and G be the search spaces in frame a and d, respectively, we have G𝒯(G).

6.2 Mechanics

\SetKwRepeatDodowhile\KwInG,hHi(G)Hu(G),v𝒳(h) v(G),(h)v
\DoH~d(G) or M~s(G) H~d(G)={hHd(G)Hu(G)Hu(h)=}
\ForhH~d(G) n|S(h)|
Let S(h)={s1,,sn} with s1<<sn
v(G),(h)f(h)(vG,σ(h)(s1),,vG,σ(h)(sn)) M~s(G)={mMs(G)Hu(m)=}
\FormM~s(G) n|S(m),h|
Let S(m),h={s1,,sn} with s1<<sn
(Gm,σi,σo)=f(m)(vG,σ(m),h(s1),,vG,σ(m),h(sn))
Ei={(o,i)(o,i)Ei(m),i=σi(σ(m),i-1(i))}
Eo={(o,i)(o,i)Eo(m),o=σo(σ(m),o-1(o))}
E(G)(E(G)E(m))(EiEo)
M(G)(M(G){m})M(Gm)
H(G)H(G)H(Gm) \ReturnG
\algorithmcfname 1 Transition
\KwInG,σo:SoOu(G) Mq𝙾𝚛𝚍𝚎𝚛𝚎𝚍𝙼𝚘𝚍𝚞𝚕𝚎𝚜(G,σo)
Hq[]
\FormMq n=|S(m),h|
Let S(m),h={s1,,sn} with s1<<sn.
\Forj[n] hσ(m),h(sj)
\IfhHq HqHq+[h]
\ForhHq \IfhHd(G) n|S(h)|
Let S(h)={s1,,sn} with s1<<sn
\Forj[n] hσ(h)(sj)
\IfhHq HqHq+[h]
\ReturnHq
\algorithmcfname 2 OrderedHyperps
Figure 6: Left: Transition assigns a value to an independent hyperparameter and resolves assignments to dependent hyperparameters (line 3 to 7) and substitutions (line 8 to 17) until none are left (line 18). Right: OrderedHyperps returns H(G) sorted according to a unique order. Adds the hyperparameters that are immediately reachable from modules (line 1 to 9), and then traverses the dependencies of the dependent hyperparameters to find additional hyperparameters (line 10 to 17).

Search space transitions

A search space G𝒢 encodes a set of architectures (i.e., those in 𝒯(G)). Different architectures are obtained through different sequences of value assignments leading to search spaces in 𝒯(G). Graph transitions result from value assignments to independent hyperparameters. Algorithm 1 shows how the search space G=𝚃𝚛𝚊𝚗𝚜𝚒𝚝𝚒𝚘𝚗(G,h,v) is computed, where hHi(G)Hu(G) and v𝒳(h). Each transition leads to progressively smaller search spaces (i.e., for all G𝒢,G=𝚃𝚛𝚊𝚗𝚜𝚒𝚝𝚒𝚘𝚗(G,h,v) for hHi(G)Hu(G) and v𝒳(h), then (G)(G)). A search space G𝒯(G) is reached once there are no independent hyperparameters left to assign values to, i.e., Hi(G)Hu(G)=. For G𝒯(G), Ms(G)=, i.e., there are only basic modules left. For search spaces G𝒢 for which Ms(G)=, we have M(G)=M(G) (i.e., Mb(G)=Mb(G)) and H(G)=H(G) for all G(G), i.e., no new modules and hyperparameters are created as a result of graph transitions. Algorithm 1 can be implemented efficiently by checking whether assigning a value to hHi(G)Hu(G) triggered substitutions of neighboring modules or value assignments to neighboring hyperparameters. For example, for the search space G of frame d of Figure 5, Ms(G)=. Search spaces G, G, and G′′ for frames a, b, and c, respectively, are related as G=𝚃𝚛𝚊𝚗𝚜𝚒𝚝𝚒𝚘𝚗(G,IH-3,1) and G′′=𝚃𝚛𝚊𝚗𝚜𝚒𝚝𝚒𝚘𝚗(G,IH-2,1). For the substitution resolved from frame b to frame c, for m=Optional-1, we have σi(𝚒𝚗)=Dropout-1.in and σo(𝚘𝚞𝚝)=Dropout-1.out (see line 12 in Algorithm 1).

Traversals over modules and hyperparameters

Search space traversal is fundamental to provide the interface to search spaces that search algorithms rely on (e.g., see Algorithm 3) and to automatically map terminal search spaces to their runnable computational graphs (see Algorithm 4 in Appendix C). For G𝒢, this iterator is implemented by using Algorithm 2 and keeping only the hyperparameters in Hu(G)Hi(G). The role of the search algorithm (e.g., see Algorithm 3) is to recursively assign values to hyperparameters in Hu(G)Hi(G) until a search space G𝒯(G) is reached. Uniquely ordered traversal of H(G) relies on uniquely ordered traversal of M(G). (We defer discussion of the module traversal to Appendix C, see Algorithm 5.)

Architecture instantiation

A search space G𝒯 can be mapped to a domain implementation (e.g. computational graph in Tensorflow [1] or PyTorch [20]). Only fully-specified basic modules are left in a terminal search space G (i.e., Hu(G)= and Ms(G)=). The mapping from a terminal search space to its implementation relies on graph traversal of the modules according to the topological ordering of their dependencies (i.e., if m connects to an output of m, then m should be visited after m).

\KwInG,σo:SoOu(G),k rbest-
\Forj[k] GG
\WhileG𝒯 Hq𝙾𝚛𝚍𝚎𝚛𝚎𝚍𝙷𝚢𝚙𝚎𝚛𝚙𝚜(G,σo)
\ForhHq \IfhHu(G)Hi(G) v𝚄𝚗𝚒𝚏𝚘𝚛𝚖(𝒳(h))
G𝚃𝚛𝚊𝚗𝚜𝚒𝚝𝚒𝚘𝚗(G,h,v) r𝙴𝚟𝚊𝚕𝚞𝚊𝚝𝚎(G)
\Ifr>rbest rbestr
GbestG \ReturnGbest
\algorithmcfname 3 Random search.
Figure 7: Assigns a value uniformly at random (line 8) for each independent hyperparameter (line 7) in the search space until a terminal search space is reached (line 4).

Appendix C details this graph propagation process (see Algorithm 4). For example, it is simple to see how the search space of frame d of Figure 5 can be mapped to an implementation.

6.3 Supporting search algorithms

Search algorithms interface with search spaces through ordered iteration over unassigned independent hyperparameters (implemented with the help of Algorithm 2) and value assignments to these hyperparameters (which are resolved with Algorithm 1). Algorithms are run for a fixed number of evaluations k, and return the best architecture found. The iteration functionality in Algorithm 2 is independent of the search space and therefore can be used to expose search spaces to search algorithms. We use this decoupling to mix and match search spaces and search algorithms without implementing each pair from scratch (see Section 7).

7 Experiments

We showcase the modularity and programmability of our language by running experiments that rely on decoupled of search spaces and search algorithms. The interface to search spaces provided by the language makes it possible to reuse implementations of search spaces and search algorithms.

7.1 Search space experiments

Table 1: Test results for search space experiments.
Search Space Test Accuracy
Genetic [26] 90.07
Flat [16] 93.58
Nasbench [27] 94.59
Nasnet [29] 93.77

We vary the search space and fix the search algorithm and the evaluation method. We refer to the search spaces we consider as Nasbench [27], Nasnet [29], Flat [16], and Genetic [26]. For the search phase, we randomly sample 128 architectures from each search space and train them for 25 epochs with Adam with a learning rate of 0.001. The test results for the fully trained architecture with the best validation accuracy are reported in Table 1. These experiments provide a simple characterization of the search spaces in terms of the number of parameters, training times, and validation performances at 25 epochs of the architectures in each search space (see Figure 8). Our language makes these characterizations easy due to better modularity (the implementations of the search space and search algorithm are decoupled) and programmability (new search spaces can be encoded and new search algorithms can be developed).

Figure 8: Results for the architectures sampled in the search space experiments. Left: Relation between number of parameters and validation accuracy at 25 epochs. Right: Relation between time to complete 25 epochs of training and validation accuracy.

7.2 Search algorithm experiments

Table 2: Test results for search algorithm experiments.
Search algorithm Test Accuracy
Random 91.61±0.67
MCTS [6] 91.45±0.11
SMBO [17] 91.93±1.03
Evolution [23] 91.32±0.50

We evaluate search algorithms by running them on the same search space. We use the Genetic search space [26] for these experiments as Figure 8 shows its architectures train quickly and have substantially different validation accuracies. We examined the performance of four search algorithms: random, regularized evolution, sequential model based optimization (SMBO), and Monte Carlo Tree Search (MCTS). Random search uniformly samples values for independent hyperparameters (see Algorithm 3). Regularized evolution [23] is an evolutionary algorithm that mutates the best performing member of the population and discards the oldest. We use population size 100 and sample size 25. For SMBO [17], we use a linear surrogate function to predict the validation accuracy of an architecture from its features (hashed modules sequences and hyperparameter values). For each architecture requested from this search algorithm, with probability 0.1 a randomly specified architecture is returned; otherwise it evaluates 512 random architectures with the surrogate model and returns the one with the best predicted validation accuracy. MCTS [6, 17] uses the Upper Confidence Bound for Trees (UCT) algorithm with the exploration term of 0.33. Each run of the search algorithm samples 256 architectures that are trained for 25 epochs with Adam with a learning rate of 0.001. We ran three trials for each search algorithm. See Figure 9 and Table 2 for the results. By comparing Table 1 and Table 2, we see that the choice of search space had a much larger impact on the test accuracies observed than the choice of search algorithm. See Appendix F for more details.

Figure 9: Results for search algorithm experiments. Left: Relation between the performance of the best architecture found and the number of architectures sampled. Right: Histogram of validation accuracies for the architectures encountered by each search algorithm.

8 Conclusions

We design a language to encode search spaces over architectures to improve the programmability and modularity of architecture search research and practice. Our language allows us to decouple the implementations of search spaces and search algorithms. This decoupling enables to mix-and-match search spaces and search algorithms without having to write each pair from scratch. We reimplement search spaces and search algorithms from the literature and compare them under the same conditions. We hope that decomposing architecture search experiments through the lens of our language will lead to more reusable and comparable architecture search research.

9 Acknowledgements

We thank the anonymous reviewers for helpful comments and suggestions. We thank Graham Neubig, Barnabas Poczos, Ruslan Salakhutdinov, Eric Xing, Xue Liu, Carolyn Rose, Zhiting Hu, Willie Neiswanger, Christoph Dann, Kirielle Singajarah, and Zejie Ai for helpful discussions. We thank Google for generous TPU and GCP grants. This work was funded in part by NSF grant IIS 1822831.

References

  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, M. Devin, et al. (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467. Cited by: §6.2.
  • [2] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl (2011) Algorithms for hyper-parameter optimization. In NeurIPS, Cited by: §2.
  • [3] J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. JMLR. Cited by: §2.
  • [4] J. Bergstra, D. Yamins, and D. Cox (2013) Hyperopt: a Python library for optimizing the hyperparameters of machine learning algorithms. Cited by: Towards modular and programmable architecture search, §1, §2, 3rd item.
  • [5] J. Bergstra, D. Yamins, and D. Cox (2013) Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. JMLR. Cited by: §2.
  • [6] C. Browne, E. Powley, D. Whitehouse, S. Lucas, P. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton (2012) A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games. Cited by: Table 4, §7.2, Table 2.
  • [7] T. Elsken, J. H. Metzen, and F. Hutter (2019) Neural architecture search: a survey. JMLR. Cited by: Towards modular and programmable architecture search, §1.
  • [8] S. Falkner, A. Klein, and F. Hutter (2018) BOHB: robust and efficient hyperparameter optimization at scale. In ICML, Cited by: Towards modular and programmable architecture search, §1, §2.
  • [9] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter (2015) Efficient and robust automated machine learning. In NeurIPS, Cited by: §2.
  • [10] F. Hutter, H. Hoos, and K. Leyton-Brown (2011) Sequential model-based optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization, Cited by: §2.
  • [11] H. Jin, Q. Song, and X. Hu (2018) Efficient neural architecture search with network morphism. arXiv:1806.10282. Cited by: §2.
  • [12] K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. Xing (2018) Neural architecture search with Bayesian optimisation and optimal transport. NeurIPS. Cited by: §2.
  • [13] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2017) Hyperband: a novel bandit-based approach to hyperparameter optimization. JMLR. Cited by: §2.
  • [14] M. Lindauer, K. Eggensperger, M. Feurer, S. Falkner, A. Biedenkapp, and F. Hutter (2017) SMACv3: algorithm configuration in Python. GitHub. Note: \urlhttps://github.com/automl/SMAC3 Cited by: §1, §2.
  • [15] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018) Progressive neural architecture search. In ECCV, Cited by: §2.
  • [16] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu (2018) Hierarchical representations for efficient architecture search. ICLR. Cited by: Table 3, §2, §7.1, Table 1.
  • [17] R. Negrinho and G. Gordon (2017) DeepArchitect: automatically designing and training deep architectures. arXiv:1704.08792. Cited by: Table 4, §2, §7.2, Table 2.
  • [18] R. Olson and J. Moore (2016) TPOT: a tree-based pipeline optimization tool for automating machine learning. In Workshop on Automatic Machine Learning, Cited by: §1.
  • [19] R. Olson, R. Urbanowicz, P. Andrews, N. Lavender, J. Moore, et al. (2016) Automating biomedical data science through tree-based pipeline optimization. In European Conference on the Applications of Evolutionary Computation, Cited by: §2.
  • [20] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. Cited by: §6.2.
  • [21] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011) Scikit-Learn: machine learning in Python. JMLR. Cited by: 2nd item.
  • [22] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. ICML. Cited by: Figure 16, Appendix B, §4.
  • [23] E. Real, A. Aggarwal, Y. Huang, and Q. Le (2019) Regularized evolution for image classifier architecture search. AAAI. Cited by: Table 4, §2, §7.2, Table 2.
  • [24] B. Shahriari, K. Swersky, Z. Wang, R. Adams, and N. De Freitas (2016) Taking the human out of the loop: a review of Bayesian optimization. Proceedings of the IEEE. Cited by: §2.
  • [25] J. Snoek, H. Larochelle, and R. Adams (2012) Practical Bayesian optimization of machine learning algorithms. In NeurIPS, Cited by: Towards modular and programmable architecture search, §1, §2.
  • [26] L. Xie and A. Yuille (2017) Genetic CNN. In ICCV, Cited by: Table 3, §7.1, §7.2, Table 1.
  • [27] C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, and F. Hutter (2019) NAS-Bench-101: towards reproducible neural architecture search. In ICML, Cited by: Table 3, §7.1, Table 1.
  • [28] B. Zoph and Q. Le (2017) Neural architecture search with reinforcement learning. ICLR. Cited by: §2.
  • [29] B. Zoph, V. Vasudevan, J. Shlens, and Q. Le (2018) Learning transferable architectures for scalable image recognition. In CVPR, Cited by: Table 3, §7.1, Table 1.

Appendix A Additional details about language components

Independent hyperparameters

@ifdisplaystyle
1 h_filters = D([32, 64, 128])
2 h_stride = D([1])
3 conv_fn = lambda h_kernel_size: conv2d(
4     h_filters, h_stride, h_kernel_size)
5 (c1_inputs, c1_outputs) = conv_fn(D([1, 3, 5]))
6 (c2_inputs, c2_outputs) = conv_fn(D([1, 3, 5]))
7 c1_outputs["out"].connect(c2_inputs["in"])
\lst
Figure 10: Search space with two convolutions in series. The number of filters is the same for both, while the kernel sizes are chosen separately.

An hyperparameter can be shared by instantiating it and using it in multiple modules. For example, in Figure 10, conv_fn has access to h_filters and h_stride through a closure and uses them in boths calls. There are 27 architectures in this search space (corresponding to the possible choices for the number of filters, stride, and kernel size). The output of the first convolution is connected to the input of the second through the call to connect (line 7).

Dependent hyperparameters

@ifdisplaystyle
1 h_filters_lst = [D([32, 64, 128])]
2 h_factor = D([1, 2, 4])
3 h_stride = D([1])
4 io_lst = []
5 for i in range(3):
6     h = h_filters_lst[i]
7     (inputs, outputs) = conv2d(h, h_stride,
8                                 D([1, 3, 5]))
9     io_lst.append((inputs, outputs))
10     if i > 0:
11         io_lst[i - 1][1]["out"].connect(
12             io_lst[i][0]["in"])
13     if i < 2:
14         h_next = DependentHyperparameter(
15             lambda x, y: x * y,
16             {"x": h, "y": h_factor})
17         h_filters_lst.append(h_next)
\lst
Figure 11: Search space with three convolutions in series. The number of filters of an inner convolution is a multiple of the number of filters of the previous convolution. The multiple is chosen through an hyperparameter (h_factor).

Chains (or general directed acyclic graphs) involving dependent and independent hyperparameters are valid. The search space in Figure 11 has three convolutional modules in series. Each convolutional module shares the hyperparameter for the stride, does not share the hyperparameter for the kernel size, and relates the hyperparameters for the number of filters via a chain of dependent hyperparameters. Each dependent hyperparameter depends on the previous hyperparameter and on the multiplier hyperparameter. This search space has 243 distinct architectures

Encoding this search space in our language might not seem advantageous when compared to encoding it in an hyperparameter optimization tool. Similarly to ours, the latter requires defining hyperparameters for the multiplier, the initial number of filters, and the three kernel sizes (chosen separately). Unfortunately, the encoding by itself tells us nothing about the mapping from hyperparameter values to implementations—the expert must write separate code for this mapping and change it when the search space changes. By contrast, in our language the expert only needs to write the encoding for the search space—the mapping to implementations is induced automatically from the encoding.

@ifdisplaystyle 1 def dense(h_units): 2     def compile_fn(di, dh): 3         m =  tf.layers.Dense(dh[’units’]) 4         def forward_fn(di): 5             return {"out": m(di["in"])} 6         return forward_fn 7     name_to_hyperp = {’units’: h_units} 8     return siso_tensorflow_module( 9         Affine’, compile_fn, name_to_hyperp, scope) \lst @ifdisplaystyle 1 def conv2d(h_num_filters, h_filter_width, h_stride): 2     def compile_fn(di, dh): 3         conv_op = tf.layers.Conv2D( 4             dh[’num_filters’], 5             (dh[’filter_width’],) * 2, 6             (dh[’stride’],) * 2, 7             padding=’SAME’) 8         def forward_fn(di): 9             return {’out’: conv_op(di[’in’])} 10         return forward_fn 11     return siso_tensorflow_module( 12         Conv2D’, compile_fn, { 13             num_filters’: h_num_filters, 14             filter_width’: h_filter_width, 15             stride’: h_stride 16         }) \lst
Figure 12: Examples of basic modules in our implementation resulting from wrapping Tensorflow operations. Left: Affine basic module with an hyperparameter for the number of units. Right: Convolutional basic module with hyperparameters for the number of filters, filter size, and stride.

Basic Modules

Deep learning layers can be easily wrapped as basic modules. For example, a dense layer can be wrapped as a single-input single-output module with one hyperparameter for the number of units (see left of Figure 12). A convolutional layer is another example of a single-input single-output module (see right of Figure 12). The implementation of conv2d relies on siso_tensorflow_module for wrapping Tensorflow-specific aspects (see Appendix E.1 for a discussion on how to support different domains). conv2d depends on hyperparameters for num_filters, filter_width, and stride. The key observation is that a basic module generates its implementation (calls to compile_fn and then forward_fn) only after its hyperparameter values have been assigned and it has values for its inputs. The values of the inputs and the hyperparameters are available in the dictionaries di and dh, respectively. conv2d returns a module as (inputs, outputs) (these are analogous to σi and σh on line of 12 of Algorithm 1). Instantiating the computational graph relies on compile_fn and forward_fn. compile_fn is called a single time, e.g., to instantiate the parameters of the basic module. forward_fn can be called multiple times to create the computational graph (in static frameworks such as Tensorflow) or to evaluate the computational graph for specific data (e.g., in dynamic frameworks such as PyTorch). Parameters instantiated in compile_fn are available to forward_fn through a closure.

@ifdisplaystyle 1 def mimo_or(fn_lst, h_or, input_names, 2         output_names, scope=None, name=None): 3     def substitution_fn(dh): 4         return fn_lst[dh["idx"]]() 5 6     return substitution_module( 7         _get_name(name, "Or"), 8         substitution_fn, 9         {’idx’: h_or}, 10         input_names, output_names, scope) \lst @ifdisplaystyle 1 def siso_repeat(fn, h_num_repeats, 2         scope=None, name=None): 3     def substitution_fn(dh): 4         assert dh["num_reps"] > 0 5         return siso_sequential([fn() 6             for _ in range(dh["num_reps"])]) 7 8     return substitution_module( 9         _get_name(name, "SISORepeat"), 10         substitution_fn, 11         {’num_reps’: h_num_repeats}, 12         [’in’], [’out’], scope) \lst @ifdisplaystyle 1 def siso_split_combine(fn, combine_fn, 2         h_num_splits, scope=None, name=None): 3     def substitution_fn(dh): 4         inputs_lst, outputs_lst = zip(*[fn() 5             for _ in range(dh["num_splits"])]) 6         c_inputs, c_outputs = combine_fn( 7             dh["num_splits"]) 8 9         i_inputs, i_outputs = identity() 10         for i in range(dh["num_splits"]): 11             i_outputs[’out’].connect( 12                 inputs_lst[i][’in’]) 13             c_inputs[’in + str(i)].connect( 14                 outputs_lst[i][’out’]) 15         return i_inputs, c_outputs 16 17     return substitution_module( 18         _get_name(name, "SISOSplitCombine"), 19         substitution_fn, 20         {’num_splits’: h_num_splits}, 21         [’in’], [’out’], scope) \lst
Figure 13: Example substitution modules implemented in our framework. Top left: mimo_or chooses between a list of functions returning search spaces. Bottom left: Creates a series connection of the search space returned by fn some number of times (determined by h_num_repeats). Right: Creates a search space with a number (determined by h_num_splits) of single-input single-output parallel search spaces created by fn that are then combined into the search space created by combine_fn.

Substitution modules

Substitution modules encode local structural transformations of the search space that are resolved once all their hyperparameters have been assigned values (see line 12 in Algorithm 1). Consider the implementation of mimo_or (i.e., mimo stands for multi-input, multi-output) in Figure 13 (top left). We make substantial use of higher-order functions and closures in our language implementation. For example, to implement a specific or substitution module, we only need to provide a list of functions that return search spaces. Arguments that the functions would need to carry are accessed through the closure or through argument binding44 4 This is often called a thunk in programming languages.. mimo_or has an hyperparameter for which subsearch space function to pick (h_idx). Once h_idx is assigned a value, substitution_fn is called, returning a search space as (inputs, outputs) where inputs and outputs are σi and σo mentioned on line 12 of Algorithm 1. Using mappings of inputs and outputs is convenient because it allow us to treat modules and search spaces the same (e.g., when connecting search spaces). The other substitution modules in Figure 13 use substitution_fn similarly.

@ifdisplaystyle
1 def substitution_module(name, name_to_hyperp,
2         substitution_fn, input_names, output_names):
\lst
Figure 14: Signature of the helper used to create substitution modules.

Figure 14 shows the signature of the wrapper function to easily create substitution modules. All information about what subsearch space should be generated upon substitution is delegated to substitution_fn. Compare this to signature of keras_module for Keras basic modules in Figure 21.

Auxiliary functions

Figure 15 shows how we often design search spaces. We have a high-level inductive bias (e.g., what operations are likely to be useful) for a good architecture for a task, but we might be unsure about low-level details (e.g., the exact sequence of operations of the architecture). Auxiliary functions allows us to encapsulate aspects of search space creation and can be reused for creating different search spaces, e.g., through different calls to these functions.

it =σ(Wiixt+bii+Whiht-1+bhi) ft =σ(Wifxt+bif+Whfht-1+bhf) gt =tanh(Wigxt+big+Whght-1+bhg) ot =σ(Wioxt+bio+Whoht-1+bho) ct =ftct-1+itgt ht =ottanh(ct) it =qi(xt,ht-1) ft =qf(xt,ht-1) gt =qg(xt,ht-1) ot =qo(xt,ht-1) ct =qc(ft,ct-1,it,gt) ht =qh(ot,ct) @ifdisplaystyle 1 def lstm_cell(input_fn, forget_fn, gate_fn, 2               output_fn, cell_fn, hidden_fn): 3 4     x_inputs, x_outputs = identity() 5     hprev_inputs, hprev_outputs = identity() 6     cprev_inputs, cprev_outputs = identity() 7 8     i_inputs, i_outputs = input_fn() 9     f_inputs, f_outputs = forget_fn() 10     g_inputs, g_outputs = gate_fn() 11     o_inputs, o_outputs = output_fn() 12     c_inputs, c_outputs = cell_fn() 13     h_inputs, h_outputs = hidden_fn() 14 15     i_inputs["in0"].connect(x_outputs["out"]) 16     i_inputs["in1"].connect(hprev_outputs["out"]) 17     f_inputs["in0"].connect(x_outputs["out"]) 18     f_inputs["in1"].connect(hprev_outputs["out"]) 19     g_inputs["in0"].connect(x_outputs["out"]) 20     g_inputs["in1"].connect(hprev_outputs["out"]) 21     c_inputs["in0"].connect(f_outputs["out"]) 22     c_inputs["in1"].connect(cprev_outputs["out"]) 23     c_inputs["in2"].connect(i_outputs["out"]) 24     c_inputs["in3"].connect(g_outputs["out"]) 25     o_inputs["in0"].connect(x_inputs["in"]) 26     o_inputs["in1"].connect(hprev_inputs["in"]) 27     h_inputs["in0"].connect(o_outputs["out"]) 28     h_inputs["in1"].connect(c_outputs["out"]) 29 30     return ({"x": x_inputs["in"], 31              "hprev": hprev_inputs["in"], 32              "cprev": cprev_inputs["in"]}, 33             {"c": c_outputs["out"], 34              "h": h_outputs["out"]}) \lst
Figure 15: Left: LSTM equations showing how the expert might abstract the LSTM structure into a general functional dependency. Right: Auxiliary function for a LSTM cell that takes functions that return the search spaces for input, output, and forget gates, and the cell update, hidden state output, and context mechanisms and arranges them together to create the larger LSTM-like search space.

Appendix B Search space example

Figure 16 shows the recurrent cell search space introduced in [22] encoded in our language implementation. This search space is composed of a sequence of nodes. For each node, we choose its type and from which node output will it get its input. The cell output is the average of the outputs of all nodes after the first one. The encoding of this search space exemplifies the expressiveness of substitution modules. The cell connection structure is created through a substitution module that has hyperparameters representing where each node will get its input from. The substitution function that creates this cell takes functions that return inputs and outputs of the subsearch spaces for the input and intermediate nodes. Each subsearch space determines the operation performed by the node. While more complex than the other examples that we have presented, the same language constructs allow us to approach the encoding of this search space. Functions cell, input_node, intermediate_node, and search_space define search spaces that are fully encapsulated and that therefore, can be reused for creating new search spaces.

@ifdisplaystyle 1 def cell(num_nodes, 2          h_units, 3          input_node_fn, 4          intermediate_node_fn, 5          combine_fn): 6 7     def substitution_fn(dh): 8         input_node = input_node_fn(h_units) 9         inter_nodes = [ 10             intermediate_node_fn(h_units) 11             for _ in range(1, num_nodes) 12         ] 13         nodes = [input_node] + inter_nodes 14 15         for i in range(1, num_nodes): 16             nodes[i][0]["in"].connect( 17                 nodes[dh[str(i)]][1]["out"]) 18 19         used_ids = set(dh.values()) 20         unused_ids = set(range(num_nodes) 21             ).difference(used_ids) 22         c_inputs, c_outputs = combine_fn( 23                 len(unused_ids)) 24         for j, i in enumerate(sorted(unused_ids)): 25             c_inputs ["in%d"%j].connect( 26                 nodes[i][1]["out"]) 27 28         return (input_node[0], 29                 {"ht+1": c_outputs["out"]}) 30 31     name_to_hyperp = {str(i): D(range(i)) 32         for i in range(1, num_nodes)} 33 34     return substitution_module("Cell", 35         substitution_fn, name_to_hyperp, 36         ["x", "ht"], ["ht+1"]) \lst @ifdisplaystyle 1 def input_node_fn(h_units): 2     h_inputs, h_outputs = affine(h_units) 3     x_inputs, x_outputs = affine(h_units) 4     a_inputs, a_outputs = add(2) 5     n_inputs, n_outputs = nonlinearity(D(["relu", 6         "tanh","sigmoid", "identity"])) 7 8     a_inputs["in0"].connect(x_outputs["out"]) 9     a_inputs["in1"].connect(h_outputs["out"]) 10     n_inputs["in"].connect(a_outputs["out"]) 11 12     return { 13         "x": x_inputs["in"], 14         "ht": h_inputs["in"]}, n_outputs 15 16 17 def intermediate_node_fn(h_units): 18     a_inputs, a_outputs = affine(h_units) 19     n_inputs, n_outputs = nonlinearity(D(["relu", 20         "tanh", "sigmoid", "identity"])) 21     a_outputs["out"].connect(n_inputs["in"]) 22     return a_inputs, n_outputs \lst @ifdisplaystyle 1 def search_space(): 2     h_units = D([32, 64, 128, 256]) 3     return cell(8, h_units, 4         input_node_fn, intermediate_node_fn, avg) \lst
Figure 16: Recurrent search space from ENAS [22] encoded using our language implementation. A substitution module is used to delay the creation of the cell topology. The code uses higher order functions to create the cell search space from the subsearch spaces of its nodes (i.e., input_node_fn and intermediate_node_fn).

Appendix C Additional details about language mechanics

Ordered module traversal

Algorithm 5 generates a unique ordering over modules M(G) by starting at the modules that have outputs in Ou(G) (which are named by σo) and traversing backwards, moving from a module to its neighboring modules (i.e., the modules that connect an output to an input of this module). A unique ordering is generated by relying on the lexicographic ordering of the local names (see lines 3 and 10 in Algorithm 5).

Architecture instantiation

Mapping an architecture G𝒯 relies on traversing M(G) in topological order. Intuitively, to do the local computation of a module mM(G) for G𝒯, the modules that m depends on (i.e., which feed an output into an input of m) must have done their local computations to produce their outputs (which will now be available as inputs to m). Graph propagation (Algorithm 4) starts with values for the unconnected inputs Iu(G) and applies local module computation according to the topological ordering of the modules until the values for the unconnected outputs Ou(G) are generated. g(m) maps input and hyperparameter values to the local computation of m. The arguments of g(m) and its results are sorted according to their local names (see lines 2 to 8).

\KwInG𝒯,x(i) for iIu(G) and x(i)𝒳(i) \FormOrderedTopologically(M(G)) S(m),h={sh,1,,sh,nh} for sh,1<<sh,nh
S(m),i={si,1,,si,ni} for si,1<<si,ni
S(m),o={so,1,,so,no} for so,1<<so,no
xjx(σ(m),i(si,j)), for j[ni]
vjv(σ(m),h(sh,j)), for j[nh]
(y1,,yno)g(m)(x1,,xni,v1,,vnh)
yσ(m),o(so,j)yj for j[no]
\For(o,i)Eo(m) x(i)y(o) \Returny(o) for oOu(G)
\algorithmcfname 4 Forward
\KwInG,σo:SoOu(G) Mq[]
n|So|
Let So={s1,,sn} with s1<<sn.
\Fork[n] mm(σo(sk))
\IfmMq MqMq+[m] \FormMq n|S(m),i|
Let S(m),i={s1,,sn} with s1<<sn.
\Whilej[n] iσ(m),i(sj)
\IfiIu(G) Take (o,i)E(G)
mm(o)
\IfmMq MqMq+[m]
\ReturnMq
\algorithmcfname 5 OrderedModules
Figure 17: Left: Forward maps a terminal search space to its domain implementation. The mapping relies on each basic module doing its local computation (encapsulated by g(m) on line 7). Forward starts with values for the unconnected inputs and traverses the modules in topological order to generate values for the unconnected outputs. Right: Iteration of M(G) according to a unique order. The first while (line 4) loop adds the modules of the outputs in Ou(G). The second while (line 8) loop traverses backwards the connections of the modules in Mq, adding new modules reached this way to Mq. m(o) denotes the module that o belongs to. See also Figure 6

Appendix D Discussion about language expressivity

D.1 Infinite search spaces

@ifdisplaystyle
1 def maybe_one_more(fn):
2     return siso_or([
3             fn, lambda: siso_sequential(
4                     [fn(), maybe_one_more(fn)])],
5             D([0, 1]))
\lst
Figure 18: Self-similar search space either returns a search space or a search space and an optional additional search space. fn returns the search space to use in this construction.

We can rely on the laziness of substitution modules to encode infinite search spaces. Figure 18 shows an example of such a search space. If the hyperparameter associated to the substitution module is assigned the value one, a new substitution module and hyperparameter are created. If the hyperparameter associated to the substitution module is assigned the value zero, recursion stops. The search space is infinite because the recursion can continue indefinitely. This search space can be used to create other search spaces compositionally. The same principles are valid for more complex search spaces involving recursion.

D.2 Search space transformation and combination

@ifdisplaystyle 1 def search_space_1(): 2     return siso_repeat( 3         lambda: siso_or([ 4             lambda: a_fn(D([0, 1])), 5             lambda: b_fn(D([0, 1])), 6             lambda: c_fn(D([0, 1]))], 7         D([0, 1, 2])), D([1, 2, 4])) \lst @ifdisplaystyle 1 def search_space_2(): 2     return siso_or([ 3         lambda: siso_repeat( 4             lambda: a_fn(D([0, 1])), 5                 D([1, 2, 4])), 6         lambda: siso_repeat( 7             lambda: b_fn(D([0, 1])), 8                 D([1, 2, 4])), 9         lambda: siso_repeat( 10             lambda: c_fn(D([0, 1])), 11                 D([1, 2, 4]))], 12         D([0, 1, 2])) \lst @ifdisplaystyle 1 def search_space_3(): 2     h = D([0, 1]) 3     return siso_or([ 4             lambda: siso_repeat( 5                 lambda: a_fn(h), D([1, 2, 4])), 6             lambda: siso_repeat( 7                 lambda: b_fn(h), D([1, 2, 4])), 8             lambda: siso_repeat( 9                 lambda: c_fn(h), D([1, 2, 4]))], 10         D([0, 1, 2])) \lst @ifdisplaystyle 1 def search_space_4(): 2     return siso_or([ 3             lambda: siso_repeat( 4                 search_space_1, D([1, 2, 4])), 5             lambda: siso_repeat( 6                 search_space_2, D([1, 2, 4])), 7             lambda: siso_repeat( 8                 search_space_3, D([1, 2, 4]))], 9         D([0, 1, 2])) \lst
Figure 19: Top left: Repeats the choice between a_fn, b_fn, and c_fn one, two, or four times. This search space shows that expressive search spaces can be created through simple arrangements of substitution modules. Bottom left: Simple transformation of search_space_1. Top right: Similar to search_space_2, but with the binary hyperparameter shared across all repetitions. Bottom right: Simple search space that is created by composing the previously defined search spaces to create a new substitution module.

We assume the existence of functions a_fn, b_fn, and c_fn that each take one binary hyperparameter and return a search space. In Figure 19, search_space_1 repeats a choice between a_fn, b_fn, and c_fn one, two, or four times. The hyperparameters for the choice (i.e., those associated to siso_or) modules are assigned values separately for each repetition. The hyperparameters associated to each a_fn, b_fn, or c_fn are also assigned values separately.

Simple rearrangements lead to dramatically different search spaces. For example, we get search_space_2 by swapping the nesting order of siso_repeat and siso_or. This search space chooses between a repetition of one, two, or four a_fn, b_fn, or c_fn. Each binary hyperparameter of the repetitions is chosen separately. search_space_3 shows that it is simple to share an hyperparameter across the repetitions by instantiating it outside the function (line 2), and access it on the function (lines 5, 7, and 9). search_space_1, search_space_2, and search_space_3 are encapsulated and can be used as any other search space. search_space_4 shows that we can easily use search_space_1, search_space_2, and search_space_3 in a new search space (compare to search_space_2).

Highly-conditional search spaces can be created through local composition of modules, reducing cognitive load. In our language, substitution modules, basic modules, dependent hyperparameters, and independent hyperparameters are well-defined constructs to encode complex search spaces. For example, a_fn might be complex, creating many modules and hyperparameters, but its definition encapsulates all this. This is one of the greatest advantages of our language, allowing us to easily create new search spaces from existing search spaces. Furthermore, the mapping from instances in the search space to implementations is automatically generated from the search space encoding.

Appendix E Implementation details

This section gives concrete details about our Python language implementation. We refer the reader to https://github.com/negrinho/deep_architect for additional code and documentation.

E.1 Supporting new domains

We only need to extend Module class to support basic modules in the new domain. We start with the common implementation of Module (see Figure 20) for both basic and substitution modules and then cover its extension to support Keras basic modules (see Figure 21).

@ifdisplaystyle
1 class Module(Addressable):
2
3     def __init__(self, scope=None, name=None):
4         scope = scope if scope is not None else Scope.default_scope
5         name = scope.get_unused_name(’.’.join(
6             [’M’, (name if name is not None else self._get_base_name()) + ’-’]))
7         Addressable.__init__(self, scope, name)
8
9         self.inputs = OrderedDict()
10         self.outputs = OrderedDict()
11         self.hyperps = OrderedDict()
12         self._is_compiled = False
13
14     def _register_input(self, name):
15         assert name not in self.inputs
16         self.inputs[name] = Input(self, self.scope, name)
17
18     def _register_output(self, name):
19         assert name not in self.outputs
20         self.outputs[name] = Output(self, self.scope, name)
21
22     def _register_hyperparameter(self, name, h):
23         assert isinstance(h, Hyperparameter) and name not in self.hyperps
24         self.hyperps[name] = h
25         h._register_module(self)
26
27     def _register(self, input_names, output_names, name_to_hyperp):
28         for name in input_names:
29             self._register_input(name)
30         for name in output_names:
31             self._register_output(name)
32         for name in sorted(name_to_hyperp):
33             self._register_hyperparameter(name, name_to_hyperp[name])
34
35     def _get_input_values(self):
36         return {name: ix.val for name, ix in iteritems(self.inputs)}
37
38     def _get_hyperp_values(self):
39         return {name: h.get_value() for name, h in iteritems(self.hyperps)}
40
41     def _set_output_values(self, output_name_to_val):
42         for name, val in iteritems(output_name_to_val):
43             self.outputs[name].val = val
44
45     def get_io(self):
46         return self.inputs, self.outputs
47
48     def get_hyperps(self):
49         return self.hyperps
50
51     def _update(self):
52         """Called when an hyperparameter that the module depends on is set."""
53         raise NotImplementedError
54
55     def _compile(self):
56         raise NotImplementedError
57
58     def _forward(self):
59         raise NotImplementedError
60
61     def forward(self):
62         if not self._is_compiled:
63             self._compile()
64             self._is_compiled = True
65         self._forward()
\lst
Figure 20: Module class used to implement both basic and substitution modules. _register_input, _register_output, _register_hyperparameter, _register, _get_hyperp_values, get_io and get_hyperps are used by both basic and substitution modules. _get_input_values, _set_output_values, _compile, _forward, and forward are used only by basic modules. _update is used only by substitution modules.

General module class

The complete implementation of Module is shown in Figure 20. Module supports the implementations of both basic modules and substitution modules. There are three types of functions in Module in Figure 20: those that are used by both basic and substitution modules (_register_input, _register_output, _register_hyperparameter, _register, _get_hyperp_values, get_io and get_hyperps); those that are used just by basic modules (_get_input_values, _set_output_values, _compile, _forward, and forward); those are used just by substitution modules (_update). We will mainly discuss its extension for basic modules as substitution modules are domain-independent (e.g., there are no domain-specific components in the substitution modules in Figure 13 and in cell in Figure 16).

Supporting basic modules in a domain relies on two functions: _compile and _forward. These functions help us map an architecture to its implementation in deep learning (slightly different functions might be necessary for other domains). forward shows how _compile and _forward are used during graph instantiation in a terminal search space. See Figure 22 for the iteration over the graph in topological ordering (determined by determine_module_eval_seq), and evaluates the forward calls in turn for the modules in the graph leading to its unconnected outputs.

_register_input, _register_output, _register_hyperparameter, and _register are used to describe the inputs and outputs of the module (i.e., _register_input and _register_output), and to associate hyperparameters to its properties (i.e., _register_hyperparameter). _register aggregates the first three functions into one. _get_hyperp_values, _get_input_values, and _set_output_values are used in _forward (see left of Figure 21. These are used in each basic module, once in a terminal search space, to retrieve its hyperparameter values (_get_hyperp_values) and its input values (_get_input_values) and to write the results of its local computation to its outputs (_set_output_values). Finally, get_io retrieves the dictionaries mapping names to inputs and outputs (these correspond to σ(m),i:S(m),iI(m) and σ(m),o:S(m),oO(m), respectively, described in Section 6). Most inputs are named in if there is a single input and in0, in1, and so on if there is more than one. Similarly, for outputs, we have out for a single output, and out0, out1, and so if there are multiple outputs. This is often seen when connecting search spaces, e.g., lines 15 to 28 in right of Figure 15. In Figure 15, we redefine σi and σo (in line 30 to line 34) to have appropriate names for the LSTM cell, but often, if possible, we just use σ(m),i and σ(m),o for σi and σo respectively, e.g., in siso_repeat and siso_combine in Figure 13.

_update is used in substitution modules (not shown in Figure 20): for a substitution module, it checks if all its hyperparameters have been assigned values and does the substitution (i.e., calls its substitution function to create a search space that takes the place of the substitution module; e.g., see frames a, b, and c of Figure 5 for a pictorial representation, and Figure 13 for implementations of substitution modules). In the examples of Figure 13, substitution_fn returns the search space to replace the substitution module with in the form of a dictionary of inputs and a dictionary of outputs (corresponding to σi and σo on line 12 of Algorithm 1). The substitution modules that we considered can be implemented with the helper in Figure 14 (e.g., see the examples in Figure 13).

In the signature of __init__ for Module, scope is a namespace used to register a module with a unique name and name is the prefix used to generate the unique name. Hyperparameters also have a unique name generated in the same way. Figure 5 shows this in how the modules and hyperparameters are named, e.g., in frame a, Conv2D-1 results from generating a unique identifier for name Conv2D (this is also captured in the use of _get_name in the examples in Figure 12 and Figure 13). When scope is not mentioned explicitly, a default global scope is used (e.g., scope is optional in Figure 20).

Extending the module class for a domain (e.g., Keras)

Figure 21 (left) shows the extension of Module to deal with basic modules in Keras. KerasModule is the extension of Module. keras_module is a convenience function that instantiates a KerasModule and returns its dictionary of local names to inputs and outputs. siso_keras_module is the same as keras_module but uses default names in and out for a single-input single-output module, which saves the expert the trouble of explicitly naming inputs and outputs for this common case. Finally, siso_keras_module_from_keras_layer_fn reduces the effort of creating basic modules from Keras functions (i.e., the function can be passed directly creating compile_fn beforehand). These functions are analogous for different deep learning frameworks, e.g., see the example usage of siso_tensorflow_module in Figure 12.

The most general helper, keras_module works by providing the local names for the inputs (input_names) and outputs (output_names), the dictionary mapping local names to hyperparameters (name_to_hyperp), and the compilation function (compile_fn), which corresponds to the _compile_fn function of the module. Calling _compile_fn returns a function (corresponding to _forward for a module, e.g., see Figure 12).

@ifdisplaystyle 1 import deep_architect.core as co 2 3 class KerasModule(co.Module): 4 5     def __init__(self, 6                     name, 7                     compile_fn, 8                     name_to_hyperp, 9                     input_names, 10                     output_names, 11                     scope=None): 12         co.Module.__init__(self, scope, name) 13         self._register(input_names, output_names, 14             name_to_hyperp) 15         self._compile_fn = compile_fn 16 17     def _compile(self): 18         input_name_to_val = self._get_input_values() 19         hyperp_name_to_val = self._get_hyperp_values() 20         self._fn = self._compile_fn( 21             input_name_to_val, hyperp_name_to_val) 22 23     def _forward(self): 24         input_name_to_val = self._get_input_values() 25         output_name_to_val = self._fn(input_name_to_val) 26         self._set_output_values(output_name_to_val) 27 28     def _update(self): 29         pass \lst @ifdisplaystyle 1 def keras_module(name, 2                  compile_fn, 3                  name_to_hyperp, 4                  input_names, 5                  output_names, 6                  scope=None): 7     return KerasModule(name, compile_fn, 8         name_to_hyperp, input_names, 9         output_names, scope).get_io() 10 11 12 def siso_keras_module(name, compile_fn, 13         name_to_hyperp, scope=None): 14     return KerasModule(name, compile_fn, 15         name_to_hyperp, [’in’], [’out’], 16                         scope).get_io() 17 18 19 def siso_keras_module_from_keras_layer_fn( 20         layer_fn, name_to_hyperp, 21         scope=None, name=None): 22 23     def compile_fn(di, dh): 24         m = layer_fn(**dh) 25 26         def forward_fn(di): 27             return {"out": m(di["in"])} 28 29         return forward_fn 30 31     if name is None: 32         name = layer_fn.__name__ 33 34     return siso_keras_module(name, 35         compile_fn, name_to_hyperp, scope) \lst
Figure 21: Left: Complete extension of the Module class (see Figure 20 for supporting Keras basic modules. Right: Convenience functions to reduce the effort of wrapping Keras operations into basic modules for common cases. See Figure 12 for examples of how they are used.
@ifdisplaystyle
1 def forward(input_to_val, _module_seq=None):
2     if _module_seq is None:
3         _module_seq = determine_module_eval_seq(input_to_val.keys())
4
5     for ix, val in iteritems(input_to_val):
6         ix.val = val
7
8     for m in _module_seq:
9         m.forward()
10         for ox in itervalues(m.outputs):
11             for ix in ox.get_connected_inputs():
12                 ix.val = ox.val
\lst
Figure 22: Generating the implementation of the architecture in a terminal search space G (e.g., the one in frame d of Figure 5). Compare to Algorithm 4: input_to_val corresponds to the x(i) for iIu(G); determine_module_eval_seq corresponds to OrderedTopologically in line 1 of Algorithm 4; Remaining code corresponds to the traversal of the modules according to this ordering, evaluation of their local computations, and propagation of results from outputs to inputs.

E.2 Implementing a search algorithm

@ifdisplaystyle
1 def random_specify_hyperparameter(hyperp):
2     assert not hyperp.has_value_assigned()
3
4     if isinstance(hyperp, hp.Discrete):
5         v = hyperp.vs[np.random.randint(len(hyperp.vs))]
6         hyperp.assign_value(v)
7     else:
8         raise ValueError
9     return v
10
11 def random_specify(outputs):
12     hyperp_value_lst = []
13     for h in co.unassigned_independent_hyperparameter_iterator(outputs):
14         v = random_specify_hyperparameter(h)
15         hyperp_value_lst.append(v)
16     return hyperp_value_lst
17
18 class RandomSearcher(Searcher):
19     def __init__(self, search_space_fn):
20         Searcher.__init__(self, search_space_fn)
21
22     def sample(self):
23         inputs, outputs = self.search_space_fn()
24         vs = random_specify(outputs)
25         return inputs, outputs, vs, {}
26
27     def update(self, val, searcher_eval_token):
28         pass
\lst
Figure 23: Implementation of random search in our language implementation. sample assigns values to all the independent hyperparameters in the search space, leading to an architecture that can be evaluated. update incorporates the results of evaluating an architecture into the state of the searcher, allowing it to use this information in the next call to sample.

Figure 23 shows random search in our implementation. random_specify_hyperparameter assigns a value uniformly at random to an independent hyperparameter. random_specify assigns all unassigned independent hyperparameters in the search space until reaching a terminal search space (each assignment leads to a search space transition; see Figure 5). RandomSearcher encapsulates the behavior of the searcher through two main functions: sample and update. sample samples an architecture from the search space, which returns inputs and outputs for the sampled terminal search space, the sequence of value assignments that led to the sampled terminal search space, and a searcher_eval_token that allows the searcher to identify the sampled terminal search space when the evaluation results are passed back to the searcher through a call to update. update incorporates the evaluation results (e.g., validation accuracy) of a sampled architecture into the state of the searcher, allowing it to use this information in the next call to sample. For random search, update is a no-op. __init__ takes the function returning a search space (e.g., search_space in Figure 16) from which architectures are to be drawn from and any other arguments that the searcher may need (e.g., exploration term in MCTS). To implement a new searcher, Searcher needs to be extended by implementing sample and update for the desired search algorithm. unassigned_independent_hyperparameter_iterator provides ordered iteration over the independent hyperparameters of the search space. The role of the search algorithm is to pick values for each of these hyperparameters, leading to a terminal space. Compare to Algorithm 3. search_space_fn returns the dictionaries of inputs and outputs for the initial state of the search space (analogous to the search space in frame a in Figure 5).

Appendix F Additional experimental results

We present the full validation and test results for both the search space experiments (Table 3) and the search algorithm experiments (Table 4). For each search space, we performed a grid search over the learning rate with values in {0.1,0.05,0.025,0.01,0.005,0.001} and an L2 penalty with values in {0.0001,0.0003,0.0005} for the architecture with the highest validation accuracy Each evaluation in the grid search was trained for 600 epochs with SGD with momentum of 0.9 and a cosine learning rate schedule We did a similar grid search for each search algorithm.

Table 3: Results for the search space experiments A grid search was performed on the best architecture from the search phase Each evaluation in the grid search was trained for 600 epochs
Search Space
Validation Accuracy
@ 25 epochs
Validation Accuracy
@ 600 epochs
Test Accuracy
@ 600 epochs
Number of
Parameters
Genetic [26] 79.03 91.13 90.07 9.4M
Flat [16] 80.69 93.70 93.58 11.3M
Nasbench [27] 87.66 95.08 94.59 2.6M
Nasnet [29] 82.35 94.56 93.77 4.5M
Table 4: Results for the search algorithm experiments A grid search was performed on the best architecture from the search phase, each trained to 600 epochs
Search algorithm Run
Validation
Accuracy
@ 25 epochs
Validation
Accuracy
@ 600 epochs
Test
Accuracy
@ 600 epochs
Random 1 77.58 92.61 92.38
2 79.09 91.93 91.30
3 81.26 92.35 91.16
Mean 79.31±1.85 92.29±0.34 91.61±0.67
MCTS [6] 1 78.68 91.97 91.33
2 78.65 91.59 91.47
3 78.65 92.69 91.55
Mean 78.66±0.02 92.08±0.56 91.45±0.11
SMBO [17] 1 77.93 93.62 92.92
2 81.80 93.05 92.03
3 82.73 91.89 90.86
Mean 80.82±2.54 92.85±0.88 91.93±1.03
Regularized evolution [23] 1 80.99 92.06 90.80
2 81.51 92.49 91.79
3 81.65 92.10 91.39
Mean 81.38±0.35 92.21±0.24 91.32±0.50