Test-time regression: a unifying framework for designing sequence models with associative memory

Abstract

Sequences provide a remarkably general way to represent and processinformation. This powerful abstraction has placed sequence modeling at thecenter of modern deep learning applications, inspiring numerous architecturesfrom transformers to recurrent networks. While this fragmented development hasyielded powerful models, it has left us without a unified framework tounderstand their fundamental similarities and explain their effectiveness. Wepresent a unifying framework motivated by an empirical observation: effectivesequence models must be able to perform associative recall. Our key insight isthat memorizing input tokens through an associative memory is equivalent toperforming regression at test-time. This regression-memory correspondenceprovides a framework for deriving sequence models that can perform associativerecall, offering a systematic lens to understand seemingly ad-hoc architecturalchoices. We show numerous recent architectures -- including linear attentionmodels, their gated variants, state-space models, online learners, and softmaxattention -- emerge naturally as specific approaches to test-time regression.Each architecture corresponds to three design choices: the relative importanceof each association, the regressor function class, and the optimizationalgorithm. This connection leads to new understanding: we provide theoreticaljustification for QKNorm in softmax attention, and we motivate higher-ordergeneralizations of softmax attention. Beyond unification, our work unlocksdecades of rich statistical tools that can guide future development of morepowerful yet principled sequence models.

Quick Read (beta)

loading the full paper ...