Less is More: Pay Less Attention in Vision Transformers

Abstract

Transformers have become one of the dominant architectures in deep learning,particularly as a powerful alternative to convolutional neural networks (CNNs)in computer vision. However, Transformer training and inference in previousworks can be prohibitively expensive due to the quadratic complexity ofself-attention over a long sequence of representations, especially forhigh-resolution dense prediction tasks. To this end, we present a novel Lessattention vIsion Transformer (LIT), building upon the fact that convolutions,fully-connected (FC) layers, and self-attentions have almost equivalentmathematical expressions for processing image patch sequences. Specifically, wepropose a hierarchical Transformer where we use pure multi-layer perceptrons(MLPs) to encode rich local patterns in the early stages while applyingself-attention modules to capture longer dependencies in deeper layers.Moreover, we further propose a learned deformable token merging module toadaptively fuse informative patches in a non-uniform manner. The proposed LITachieves promising performance on image recognition tasks, including imageclassification, object detection and instance segmentation, serving as a strongbackbone for many vision tasks.

Quick Read (beta)

loading the full paper ...