How Smooth Is Attention?

Abstract

Self-attention and masked self-attention are at the heart of Transformers'outstanding success. Still, our mathematical understanding of attention, inparticular of its Lipschitz properties - which are key when it comes toanalyzing robustness and expressive power - is incomplete. We provide adetailed study of the Lipschitz constant of self-attention in several practicalscenarios, discussing the impact of the sequence length $n$ and layernormalization on the local Lipschitz constant of both unmasked and maskedself-attention. In particular, we show that for inputs of length $n$ in anycompact set, the Lipschitz constant of self-attention is bounded by $\sqrt{n}$up to a constant factor and that this bound is tight for reasonable sequencelengths. When the sequence length $n$ is too large for the previous bound to betight, which we refer to as the mean-field regime, we provide an upper boundand a matching lower bound which are independent of $n$. Our mean-fieldframework for masked self-attention is novel and of independent interest. Ourexperiments on pretrained and randomly initialized BERT and GPT-2 support ourtheoretical findings.

Quick Read (beta)

loading the full paper ...