MaxViT: Multi-Axis Vision Transformer

Abstract

Transformers have recently gained significant attention in the computervision community. However, the lack of scalability of self-attention mechanismswith respect to image size has limited their wide adoption in state-of-the-artvision backbones. In this paper we introduce an efficient and scalableattention model we call multi-axis attention, which consists of two aspects:blocked local and dilated global attention. These design choices allowglobal-local spatial interactions on arbitrary input resolutions with onlylinear complexity. We also present a new architectural element by effectivelyblending our proposed attention model with convolutions, and accordinglypropose a simple hierarchical vision backbone, dubbed MaxViT, by simplyrepeating the basic building block over multiple stages. Notably, MaxViT isable to ''see'' globally throughout the entire network, even in earlier,high-resolution stages. We demonstrate the effectiveness of our model on abroad spectrum of vision tasks. On image classification, MaxViT achievesstate-of-the-art performance under various settings: without extra data, MaxViTattains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, ourmodel achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbonedelivers favorable performance on object detection as well as visual aestheticassessment. We also show that our proposed model expresses strong generativemodeling capability on ImageNet, demonstrating the superior potential of MaxViTblocks as a universal vision module. The source code and trained models will beavailable at https://github.com/google-research/maxvit.

Quick Read (beta)

loading the full paper ...