Scalable-Softmax Is Superior for Attention

Abstract

The maximum element of the vector output by the Softmax function approacheszero as the input vector size increases. Transformer-based language models relyon Softmax to compute attention scores, causing the attention distribution toflatten as the context size grows. This reduces the model's ability toprioritize key information effectively and potentially limits its lengthgeneralization. To address this problem, we propose Scalable-Softmax (SSMax),which replaces Softmax in scenarios where the input vector size varies. SSMaxcan be seamlessly integrated into existing Transformer-based architectures.Experimental results in language modeling show that models using SSMax not onlyachieve faster loss reduction during pretraining but also significantly improveperformance in long contexts and key information retrieval. Furthermore, ananalysis of attention scores reveals that SSMax enables the model to focusattention on key information even in long contexts. Additionally, althoughmodels that use SSMax from the beginning of pretraining achieve better lengthgeneralization, those that have already started pretraining can still gain someof this ability by replacing Softmax in the attention layers with SSMax, eitherduring or after pretraining.

Quick Read (beta)

loading the full paper ...