Universal Transformers - Paper Detail

Abstract

Self-attentive feed-forward sequence models have been shown to achieveimpressive results on sequence modeling tasks, thereby presenting a compellingalternative to recurrent neural networks (RNNs) which has remained the de-factostandard architecture for many sequence modeling problems to date. Despitethese successes, however, feed-forward sequence models like the Transformerfail to generalize in many tasks that recurrent models handle with ease (e.g.copying when the string lengths exceed those observed at training time).Moreover, and in contrast to RNNs, the Transformer model is not computationallyuniversal, limiting its theoretical expressivity. In this paper we propose theUniversal Transformer which addresses these practical and theoreticalshortcomings and we show that it leads to improved performance on severaltasks. Instead of recurring over the individual symbols of sequences like RNNs,the Universal Transformer repeatedly revises its representations of all symbolsin the sequence with each recurrent step. In order to combine information fromdifferent parts of a sequence, it employs a self-attention mechanism in everyrecurrent step. Assuming sufficient memory, its recurrence makes the UniversalTransformer computationally universal. We further employ an adaptivecomputation time (ACT) mechanism to allow the model to dynamically adjust thenumber of times the representation of each position in a sequence is revised.Beyond saving computation, we show that ACT can improve the accuracy of themodel. Our experiments show that on various algorithmic tasks and a diverse setof large-scale language understanding tasks the Universal Transformergeneralizes significantly better and outperforms both a vanilla Transformer andan LSTM in machine translation, and achieves a new state of the art on the bAbIlinguistic reasoning task and the challenging LAMBADA language modeling task.

Quick Read (beta)

loading the full paper ...