When Attention Sink Emerges in Language Models: An Empirical View

Abstract

Language Models (LMs) assign significant attention to the first token, evenif it is not semantically important, which is known as attention sink. Thisphenomenon has been widely adopted in applications such as streaming/longcontext generation, KV cache optimization, inference acceleration, modelquantization, and others. Despite its widespread use, a deep understanding ofattention sink in LMs is still lacking. In this work, we first demonstrate thatattention sinks exist universally in LMs with various inputs, even in smallmodels. Furthermore, attention sink is observed to emerge during the LMpre-training, motivating us to investigate how optimization, data distribution,loss function, and model architecture in LM pre-training influence itsemergence. We highlight that attention sink emerges after effectiveoptimization on sufficient training data. The sink position is highlycorrelated with the loss function and data distribution. Most importantly, wefind that attention sink acts more like key biases, storing extra attentionscores, which could be non-informative and not contribute to the valuecomputation. We also observe that this phenomenon (at least partially) stemsfrom tokens' inner dependence on attention scores as a result of softmaxnormalization. After relaxing such dependence by replacing softmax attentionwith other attention operations, such as sigmoid attention withoutnormalization, attention sinks do not emerge in LMs up to 1B parameters. Thecode is available at https://github.com/sail-sg/Attention-Sink.

Quick Read (beta)

loading the full paper ...