Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Abstract

This paper presents a new vision Transformer, called Swin Transformer, thatcapably serves as a general-purpose backbone for computer vision. Challenges inadapting Transformer from language to vision arise from differences between thetwo domains, such as large variations in the scale of visual entities and thehigh resolution of pixels in images compared to words in text. To address thesedifferences, we propose a hierarchical Transformer whose representation iscomputed with shifted windows. The shifted windowing scheme brings greaterefficiency by limiting self-attention computation to non-overlapping localwindows while also allowing for cross-window connection. This hierarchicalarchitecture has the flexibility to model at various scales and has linearcomputational complexity with respect to image size. These qualities of SwinTransformer make it compatible with a broad range of vision tasks, includingimage classification (86.4 top-1 accuracy on ImageNet-1K) and dense predictiontasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev)and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpassesthe previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask APon COCO, and +3.2 mIoU on ADE20K, demonstrating the potential ofTransformer-based models as vision backbones. The code and models will be madepublicly available at~\url{https://github.com/microsoft/Swin-Transformer}.

Quick Read (beta)

loading the full paper ...