The Evolved Transformer

Abstract

Recent works have highlighted the strengths of the Transformer architecturefor dealing with sequence tasks. At the same time, neural architecture searchhas advanced to the point where it can outperform human-designed models. Thegoal of this work is to use architecture search to find a better Transformerarchitecture. We first construct a large search space inspired by the recentadvances in feed-forward sequential models and then run evolutionaryarchitecture search, seeding our initial population with the Transformer. Toeffectively run this search on the computationally expensive WMT 2014English-German translation task, we develop the progressive dynamic hurdlesmethod, which allows us to dynamically allocate more resources to morepromising candidate models. The architecture found in our experiments - theEvolved Transformer - demonstrates consistent improvement over the Transformeron four well-established language tasks: WMT 2014 English-German, WMT 2014English-French, WMT 2014 English-Czech and LM1B. At big model size, the EvolvedTransformer is twice as efficient as the Transformer in FLOPS without loss inquality. At a much smaller - mobile-friendly - model size of ~7M parameters,the Evolved Transformer outperforms the Transformer by 0.7 BLEU on WMT'14English-German.

Quick Read (beta)

loading the full paper ...