Abstract
Vision Transformers (ViTs) mark a revolutionary advance in neural networkswith their token mixer's powerful global context capability. However, thepairwise token affinity and complex matrix operations limit its deployment onresource-constrained scenarios and real-time applications, such as mobiledevices, although considerable efforts have been made in previous works. Inthis paper, we introduce CAS-ViT: Convolutional Additive Self-attention VisionTransformers, to achieve a balance between efficiency and performance in mobileapplications. Firstly, we argue that the capability of token mixers to obtainglobal contextual information hinges on multiple information interactions, suchas spatial and channel domains. Subsequently, we propose Convolutional AdditiveToken Mixer (CATM) employing underlying spatial and channel attention as novelinteraction forms. This module eliminates troublesome complex operations suchas matrix multiplication and Softmax. We introduce Convolutional AdditiveSelf-attention(CAS) block hybrid architecture and utilize CATM for each block.And further, we build a family of lightweight networks, which can be easilyextended to various downstream tasks. Finally, we evaluate CAS-ViT across avariety of vision tasks, including image classification, object detection,instance segmentation, and semantic segmentation. Our M and T model achieves83.0\%/84.1\% top-1 with only 12M/21M parameters on ImageNet-1K. Meanwhile,throughput evaluations on GPUs, ONNX, and iPhones also demonstrate superiorresults compared to other state-of-the-art backbones. Extensive experimentsdemonstrate that our approach achieves a better balance of performance,efficient inference and easy-to-deploy. Our code and model are available at:\url{https://github.com/Tianfang-Zhang/CAS-ViT}