Long-Short Transformer: Efficient Transformers for Language and Vision

Abstract

Transformers have achieved success in both language and vision domains.However, it is prohibitively expensive to scale them to long sequences such aslong documents or high-resolution images, because self-attention mechanism hasquadratic time and memory complexities with respect to the input sequencelength. In this paper, we propose Long-Short Transformer (Transformer-LS), anefficient self-attention mechanism for modeling long sequences with linearcomplexity for both language and vision tasks. It aggregates a novel long-rangeattention with dynamic projection to model distant correlations and ashort-term attention to capture fine-grained local correlations. We propose adual normalization strategy to account for the scale mismatch between the twoattention mechanisms. Transformer-LS can be applied to both autoregressive andbidirectional models without additional complexity. Our method outperforms thestate-of-the-art models on multiple tasks in language and vision domains,including the Long Range Arena benchmark, autoregressive language modeling, andImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC onenwik8 using half the number of parameters than previous method, while beingfaster and is able to handle 3x as long sequences compared to itsfull-attention version on the same hardware. On ImageNet, it can obtain thestate-of-the-art results (e.g., a moderate size of 55.8M model solely trainedon 224x224 ImageNet-1K can obtain Top-1 accuracy 84.1%), while being morescalable on high-resolution images. The source code and models are released athttps://github.com/NVIDIA/transformer-ls .

Quick Read (beta)

loading the full paper ...