Pay Less Attention with Lightweight and Dynamic Convolutions

Abstract

Self-attention is a useful mechanism to build generative models for languageand images. It determines the importance of context elements by comparing eachelement to the current time step. In this paper, we show that a verylightweight convolution can perform competitively to the best reportedself-attention results. Next, we introduce dynamic convolutions which aresimpler and more efficient than self-attention. We predict separate convolutionkernels based solely on the current time-step in order to determine theimportance of context elements. The number of operations required by thisapproach scales linearly in the input length, whereas self-attention isquadratic. Experiments on large-scale machine translation, language modelingand abstractive summarization show that dynamic convolutions improve overstrong self-attention models. On the WMT'14 English-German test set dynamicconvolutions achieve a new state of the art of 29.7 BLEU.

Quick Read (beta)

loading the full paper ...