Rethinking and Improving Relative Position Encoding for Vision Transformer

Abstract

Relative position encoding (RPE) is important for transformer to capturesequence ordering of input tokens. General efficacy has been proven in naturallanguage processing. However, in computer vision, its efficacy is not wellstudied and even remains controversial, e.g., whether relative positionencoding can work equally well as absolute position? In order to clarify this,we first review existing relative position encoding methods and analyze theirpros and cons when applied in vision transformers. We then propose new relativeposition encoding methods dedicated to 2D images, called image RPE (iRPE). Ourmethods consider directional relative distance modeling as well as theinteractions between queries and relative position embeddings in self-attentionmechanism. The proposed iRPE methods are simple and lightweight. They can beeasily plugged into transformer blocks. Experiments demonstrate that solely dueto the proposed encoding methods, DeiT and DETR obtain up to 1.5% (top-1 Acc)and 1.3% (mAP) stable improvements over their original versions on ImageNet andCOCO respectively, without tuning any extra hyperparameters such as learningrate and weight decay. Our ablation and analysis also yield interestingfindings, some of which run counter to previous understanding. Code and modelsare open-sourced at https://github.com/microsoft/Cream/tree/main/iRPE.

Quick Read (beta)

loading the full paper ...