Abstract
We present Multiscale Vision Transformers (MViT) for video and imagerecognition, by connecting the seminal idea of multiscale feature hierarchieswith transformer models. Multiscale Transformers have severalchannel-resolution scale stages. Starting from the input resolution and a smallchannel dimension, the stages hierarchically expand the channel capacity whilereducing the spatial resolution. This creates a multiscale pyramid of featureswith early layers operating at high spatial resolution to model simplelow-level visual information, and deeper layers at spatially coarse, butcomplex, high-dimensional features. We evaluate this fundamental architecturalprior for modeling the dense nature of visual signals for a variety of videorecognition tasks where it outperforms concurrent vision transformers that relyon large scale external pre-training and are 5-10x more costly in computationand parameters. We further remove the temporal dimension and apply our modelfor image classification where it outperforms prior work on visiontransformers. Code is available at:https://github.com/facebookresearch/SlowFast