EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

Abstract

Vision transformers have shown great success due to their high modelcapabilities. However, their remarkable performance is accompanied by heavycomputation costs, which makes them unsuitable for real-time applications. Inthis paper, we propose a family of high-speed vision transformers namedEfficientViT. We find that the speed of existing transformer models is commonlybounded by memory inefficient operations, especially the tensor reshaping andelement-wise functions in MHSA. Therefore, we design a new building block witha sandwich layout, i.e., using a single memory-bound MHSA between efficient FFNlayers, which improves memory efficiency while enhancing channel communication.Moreover, we discover that the attention maps share high similarities acrossheads, leading to computational redundancy. To address this, we present acascaded group attention module feeding attention heads with different splitsof the full feature, which not only saves computation cost but also improvesattention diversity. Comprehensive experiments demonstrate EfficientViToutperforms existing efficient models, striking a good trade-off between speedand accuracy. For instance, our EfficientViT-M5 surpasses MobileNetV3-Large by1.9% in accuracy, while getting 40.4% and 45.2% higher throughput on NvidiaV100 GPU and Intel Xeon CPU, respectively. Compared to the recent efficientmodel MobileViT-XXS, EfficientViT-M2 achieves 1.8% superior accuracy, whilerunning 5.8x/3.7x faster on the GPU/CPU, and 7.4x faster when converted to ONNXformat. Code and models are available athttps://github.com/microsoft/Cream/tree/main/EfficientViT.

Quick Read (beta)

loading the full paper ...