ResT: An Efficient Transformer for Visual Recognition

Abstract

This paper presents an efficient multi-scale vision Transformer, called ResT,that capably served as a general-purpose backbone for image recognition. Unlikeexisting Transformer methods, which employ standard Transformer blocks totackle raw images with a fixed resolution, our ResT have several advantages:(1) A memory-efficient multi-head self-attention is built, which compresses thememory by a simple depth-wise convolution, and projects the interaction acrossthe attention-heads dimension while keeping the diversity ability ofmulti-heads; (2) Position encoding is constructed as spatial attention, whichis more flexible and can tackle with input images of arbitrary size withoutinterpolation or fine-tune; (3) Instead of the straightforward tokenization atthe beginning of each stage, we design the patch embedding as a stack ofoverlapping convolution operation with stride on the 2D-reshaped token map. Wecomprehensively validate ResT on image classification and downstream tasks.Experimental results show that the proposed ResT can outperform the recentlystate-of-the-art backbones by a large margin, demonstrating the potential ofResT as strong backbones. The code and models will be made publicly availableat https://github.com/wofmanaf/ResT.

Quick Read (beta)

loading the full paper ...