Making Vision Transformers Truly Shift-Equivariant

Abstract

For computer vision tasks, Vision Transformers (ViTs) have become one of thego-to deep net architectures. Despite being inspired by Convolutional NeuralNetworks (CNNs), ViTs remain sensitive to small shifts in the input image. Toaddress this, we introduce novel designs for each of the modules in ViTs, suchas tokenization, self-attention, patch merging, and positional encoding. Withour proposed modules, we achieve truly shift-equivariant ViTs on fourwell-established models, namely, Swin, SwinV2, MViTv2, and CvT, both in theoryand practice. Empirically, we tested these models on image classification andsemantic segmentation, achieving competitive performance across three differentdatasets while maintaining 100% shift consistency.

Quick Read (beta)

loading the full paper ...