Long-Short Transformer: Efficient Transformers for Language and Vision

  • 2021-07-27 19:34:30
  • Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar, Bryan Catanzaro
  • 1

Abstract

Transformers have achieved success in both language and vision domains.However, it is prohibitively expensive to scale them to long sequences such aslong documents or high-resolution images, because self-attention mechanism hasquadratic time and memory complexities with respect to the input sequencelength. In this paper, we propose Long-Short Transformer (Transformer-LS), anefficient self-attention mechanism for modeling long sequences with linearcomplexity for both language and vision tasks. It aggregates a novel long-rangeattention with dynamic projection to model distant correlations and ashort-term attention to capture fine-grained local correlations. We propose adual normalization strategy to account for the scale mismatch between the twoattention mechanisms. Transformer-LS can be applied to both autoregressive andbidirectional models without additional complexity. Our method outperforms thestate-of-the-art models on multiple tasks in language and vision domains,including the Long Range Arena benchmark, autoregressive language modeling, andImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC onenwik8 using half the number of parameters than previous method, while beingfaster and is able to handle 3x as long sequences compared to itsfull-attention version on the same hardware. On ImageNet, it can obtain thestate-of-the-art results (e.g., a moderate size of 55.8M model solely trainedon 224x224 ImageNet-1K can obtain Top-1 accuracy 84.1%), while being morescalable on high-resolution images. The source code and models are released athttps://github.com/NVIDIA/transformer-ls .

 

Quick Read (beta)

loading the full paper ...