Transformers are Multi-State RNNs

Abstract

Transformers are considered conceptually different compared to the previousgeneration of state-of-the-art NLP models - recurrent neural networks (RNNs).In this work, we demonstrate that decoder-only transformers can in fact beconceptualized as infinite multi-state RNNs - an RNN variant with unlimitedhidden state size. We further show that pretrained transformers can beconverted into $\textit{finite}$ multi-state RNNs by fixing the size of theirhidden state. We observe that several existing transformers cache compressiontechniques can be framed as such conversion policies, and introduce a novelpolicy, TOVA, which is simpler compared to these policies. Our experiments withseveral long range tasks indicate that TOVA outperforms all other baselinepolicies, while being nearly on par with the full (infinite) model, and usingin some cases only $\frac{1}{8}$ of the original cache size. Our resultsindicate that transformer decoder LLMs often behave in practice as RNNs. Theyalso lay out the option of mitigating one of their most painful computationalbottlenecks - the size of their cache memory. We publicly release our code athttps://github.com/schwartz-lab-NLP/TOVA.

Quick Read (beta)

loading the full paper ...