Abstract
In this paper, we study Multiscale Vision Transformers (MViT) as a unifiedarchitecture for image and video classification, as well as object detection.We present an improved version of MViT that incorporates decomposed relativepositional embeddings and residual pooling connections. We instantiate thisarchitecture in five sizes and evaluate it for ImageNet classification, COCOdetection and Kinetics video recognition where it outperforms prior work. Wefurther compare MViTs' pooling attention to window attention mechanisms whereit outperforms the latter in accuracy/compute. Without bells-and-whistles, MViThas state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNetclassification, 56.1 box AP on COCO object detection as well as 86.1% onKinetics-400 video classification. Code and models will be made publiclyavailable.