Relaxed Attention for Transformer Models

Abstract

The powerful modeling capabilities of all-attention-based transformerarchitectures often cause overfitting and - for natural language processingtasks - lead to an implicitly learned internal language model in theautoregressive transformer decoder complicating the integration of externallanguage models. In this paper, we explore relaxed attention, a simple andeasy-to-implement smoothing of the attention weights, yielding a two-foldimprovement to the general transformer architecture: First, relaxed attentionprovides regularization when applied to the self-attention layers in theencoder. Second, we show that it naturally supports the integration of anexternal language model as it suppresses the implicitly learned internallanguage model by relaxing the cross attention in the decoder. We demonstratethe benefit of relaxed attention across several tasks with clear improvement incombination with recent benchmark approaches. Specifically, we exceed theformer state-of-the-art performance of 26.90% word error rate on the largestpublic lip-reading LRS3 benchmark with a word error rate of 26.31%, as well aswe achieve a top-performing BLEU score of 37.67 on the IWSLT14(DE$\rightarrow$EN) machine translation task without external language modelsand virtually no additional model parameters. Code and models will be madepublicly available.

Quick Read (beta)

loading the full paper ...