Stick-breaking Attention

Abstract

The self-attention mechanism traditionally relies on the softmax operator,necessitating positional embeddings like RoPE, or position biases to accountfor token order. But current methods using still face length generalisationchallenges. We propose an alternative attention mechanism based on thestick-breaking process: For each token before the current, we determine a breakpoint $\beta_{i,j}$, which represents the proportion of the remaining stick toallocate to the current token. We repeat the process until the stick is fullyallocated, resulting in a sequence of attention weights. This process naturallyincorporates recency bias, which has linguistic motivations for grammar parsing(Shen et. al., 2017). We study the implications of replacing the conventionalsoftmax-based attention mechanism with stick-breaking attention. We thendiscuss implementation of numerically stable stick-breaking attention and adaptFlash Attention to accommodate this mechanism. When used as a drop-inreplacement for current softmax+RoPE attention systems, we find thatstick-breaking attention performs competitively with current methods on lengthgeneralisation and downstream tasks. Stick-breaking also performs well atlength generalisation, allowing a model trained with $2^{11}$ context window toperform well at $2^{14}$ with perplexity improvements.

Quick Read (beta)

loading the full paper ...