Inductive Biases and Variable Creation in Self-Attention Mechanisms

Abstract

Self-attention, an architectural motif designed to model long-rangeinteractions in sequential data, has driven numerous recent breakthroughs innatural language processing and beyond. This work provides a theoreticalanalysis of the inductive biases of self-attention modules, where our focus isto rigorously establish which functions and long-range dependenciesself-attention blocks prefer to represent. Our main result shows thatbounded-norm Transformer layers create sparse variables: they can representsparse functions of the input sequence, with sample complexity scaling onlylogarithmically with the context length. Furthermore, we propose newexperimental protocols to support this analysis and to guide the practice oftraining Transformers, built around the large body of work on provably learningsparse Boolean functions.

Quick Read (beta)

loading the full paper ...