Abstract
We present techniques for scaling Swin Transformer up to 3 billion parametersand making it capable of training with images of up to 1,536$\times$1,536resolution. By scaling up capacity and resolution, Swin Transformer sets newrecords on four representative vision benchmarks: 84.0% top-1 accuracy onImageNet-V2 image classification, 63.1/54.4 box/mask mAP on COCO objectdetection, 59.9 mIoU on ADE20K semantic segmentation, and 86.8% top-1 accuracyon Kinetics-400 video action classification. Our techniques are generallyapplicable for scaling up vision models, which has not been widely explored asthat of NLP language models, partly due to the following difficulties intraining and applications: 1) vision models often face instability issues atscale and 2) many downstream vision tasks require high resolution images orwindows and it is not clear how to effectively transfer models pre-trained atlow resolutions to higher resolution ones. The GPU memory consumption is also aproblem when the image resolution is high. To address these issues, we presentseveral techniques, which are illustrated by using Swin Transformer as a casestudy: 1) a post normalization technique and a scaled cosine attention approachto improve the stability of large vision models; 2) a log-spaced continuousposition bias technique to effectively transfer models pre-trained atlow-resolution images and windows to their higher-resolution counterparts. Inaddition, we share our crucial implementation details that lead to significantsavings of GPU memory consumption and thus make it feasible to train largevision models with regular GPUs. Using these techniques and self-supervisedpre-training, we successfully train a strong 3B Swin Transformer model andeffectively transfer it to various vision tasks involving high-resolutionimages or windows, achieving the state-of-the-art accuracy on a variety ofbenchmarks.