Abstract
Attention-based models have been a key element of many recent breakthroughsin deep learning. Two key components of Attention are the structure of itsinput (which consists of keys, values and queries) and the computations bywhich these three are combined. In this paper we explore the space of modelsthat share said input structure but are not restricted to the computations ofAttention. We refer to this space as Keys-Values-Queries (KVQ) Space. Our goalis to determine whether there are any other stackable models in KVQ Space thatAttention cannot efficiently approximate, which we can implement with ourcurrent deep learning toolbox and that solve problems that are interesting tothe community. Maybe surprisingly, the solution to the standard least squaresproblem satisfies these properties. A neural network module that is able tocompute this solution not only enriches the set of computations that a neuralnetwork can represent but is also provably a strict generalisation of LinearAttention. Even more surprisingly the computational complexity of this moduleis exactly the same as that of Attention, making it a suitable drop inreplacement. With this novel connection between classical machine learning(least squares) and modern deep learning (Attention) established we justify avariation of our model which generalises regular Attention in the same way.Both new modules are put to the test an a wide spectrum of tasks ranging fromfew-shot learning to policy distillation that confirm their real-worldsapplicability.