Deconstructing Positional Information: From Attention Logits to Training Biases

Abstract

Positional encodings enable Transformers to incorporate sequential information, yet their theoretical understanding remains limited to two properties: distance attenuation and translation invariance. Because natural language lacks purely positional data, the interplay between positional and semantic information is still underexplored. We address this gap by deconstructing the attention-logit computation and providing a structured analysis of positional encodings, categorizing them into additive and multiplicative forms. The differing properties of these forms lead to distinct mechanisms for capturing positional information. To probe this difference, we design a synthetic task that explicitly requires strong integration of positional and semantic cues. As predicted, multiplicative encodings achieve a clear performance advantage on this task. Moreover, our evaluation reveals a hidden training bias: an information aggregation effect in shallow layers that we term the single-head deposit pattern. Through ablation studies and theoretical analysis, we proved that this phenomenon is inherent in multiplicative encodings. These findings deepen the understanding of positional encodings and call for further study of their training dynamics.

Quick Read (beta)

loading the full paper ...