Abstract
The time complexity of the standard attention mechanism in transformersscales quadratically with sequence length. We propose a probabilistic frameworkfor attention, enabling us to derive a novel low-rank linearre-parameterisation of both bidirectional and causal cases, based on defining alatent variable model. Our method can be seamlessly integrated as a drop-inreplacement for the standard attention mechanism. Additionally, this frameworkprovides a natural extension for combining local standard attention with ourglobal linear attention. This approach allows us to extend the context lengthof existing large pre-trained models with only a few additional training steps.The resulting ``Latte Transformer'' achieves performance comparable to standardattention and other state-of-the-art models, while maintaining linear time andmemory complexity, along with constant-time next-token prediction duringinference.