MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Abstract

Light-weight convolutional neural networks (CNNs) are the de-facto for mobilevision tasks. Their spatial inductive biases allow them to learnrepresentations with fewer parameters across different vision tasks. However,these networks are spatially local. To learn global representations,self-attention-based vision trans-formers (ViTs) have been adopted. UnlikeCNNs, ViTs are heavy-weight. In this paper, we ask the following question: isit possible to combine the strengths of CNNs and ViTs to build a light-weightand low latency network for mobile vision tasks? Towards this end, we introduceMobileViT, a light-weight and general-purpose vision transformer for mobiledevices. MobileViT presents a different perspective for the global processingof information with transformers, i.e., transformers as convolutions. Ourresults show that MobileViT significantly outperforms CNN- and ViT-basednetworks across different tasks and datasets. On the ImageNet-1k dataset,MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters,which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT(ViT-based) for a similar number of parameters. On the MS-COCO object detectiontask, MobileViT is 5.7% more accurate than Mo-bileNetv3 for a similar number ofparameters.

Quick Read (beta)

loading the full paper ...