### Abstract

Large Transformer models routinely achieve state-of-the-art results on anumber of tasks but training these models can be prohibitively costly,especially on long sequences. We introduce two techniques to improve theefficiency of Transformers. For one, we replace dot-product attention by onethat uses locality-sensitive hashing, changing its complexity from O($L^2$) toO($L\log L$), where $L$ is the length of the sequence. Furthermore, we usereversible residual layers instead of the standard residuals, which allowsstoring activations only once in the training process instead of $N$ times,where $N$ is the number of layers. The resulting model, the Reformer, performson par with Transformer models while being much more memory-efficient and muchfaster on long sequences.