Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference

Abstract

Vision Transformers (ViTs) have shown impressive performance but stillrequire a high computation cost as compared to convolutional neural networks(CNNs), one reason is that ViTs' attention measures global similarities andthus has a quadratic complexity with the number of input tokens. Existingefficient ViTs adopt local attention (e.g., Swin) or linear attention (e.g.,Performer), which sacrifice ViTs' capabilities of capturing either global orlocal context. In this work, we ask an important research question: Can ViTslearn both global and local context while being more efficient duringinference? To this end, we propose a framework called Castling-ViT, whichtrains ViTs using both linear-angular attention and masked softmax-basedquadratic attention, but then switches to having only linear angular attentionduring ViT inference. Our Castling-ViT leverages angular kernels to measure thesimilarities between queries and keys via spectral angles. And we furthersimplify it with two techniques: (1) a novel linear-angular attentionmechanism: we decompose the angular kernels into linear terms and high-orderresiduals, and only keep the linear terms; and (2) we adopt two parameterizedmodules to approximate high-order residuals: a depthwise convolution and anauxiliary masked softmax attention to help learn both global and localinformation, where the masks for softmax attention are regularized to graduallybecome zeros and thus incur no overhead during ViT inference. Extensiveexperiments and ablation studies on three tasks consistently validate theeffectiveness of the proposed Castling-ViT, e.g., achieving up to a 1.8% higheraccuracy or 40% MACs reduction on ImageNet classification and 1.2 higher mAP onCOCO detection under comparable FLOPs, as compared to ViTs with vanillasoftmax-based attentions.

Quick Read (beta)

loading the full paper ...