Abstract
Attention layers are the core component of transformers, the currentstate-of-the-art neural network architecture. Alternatives to softmax-basedattention are being explored due to its tendency to hinder effectiveinformation flow. Even at initialisation, it remains poorly understood why thepropagation of signals and gradients through these random networks can bepathological, resulting in issues known as (i) vanishing/exploding gradientsand (ii) rank collapse $\textit{in depth}$, i.e. when all tokens converge to asingle representation along layers. While rank collapse in depth naturallyarises from repeated matrix multiplications$\unicode{x2013}$a common patternacross various architectures$\unicode{x2013}$we identify an additional andpreviously unknown challenge unique to softmax attention layers: (iii) rankcollapse $\textit{in width}$, which occurs as the context length increases.Using Random Matrix Theory, we conduct a rigorous analysis that uncovers aspectral gap between the two largest singular values of the attention matrix asthe cause of (iii), which in turn exacerbates (i) and (ii). Building on thisinsight, we propose a novel yet simple practical solution to mitigate rankcollapse in width by removing the outlier eigenvalue(s). Our theoreticalframework offers a fresh perspective on recent practical studies, such as (Yeet al., 2024; Ali et al., 2023), whose ad hoc solutions can now be interpretedas implicit efforts to address the spectral gap issue. This work providesvaluable theoretical support for ongoing large-scale empirical research,bringing theory and practice one step closer in the understanding oftransformers.