Multiscale Vision Transformers

  • 2021-04-22 17:59:45
  • Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer
  • 24


We present Multiscale Vision Transformers (MViT) for video and imagerecognition, by connecting the seminal idea of multiscale feature hierarchieswith transformer models. Multiscale Transformers have severalchannel-resolution scale stages. Starting from the input resolution and a smallchannel dimension, the stages hierarchically expand the channel capacity whilereducing the spatial resolution. This creates a multiscale pyramid of featureswith early layers operating at high spatial resolution to model simplelow-level visual information, and deeper layers at spatially coarse, butcomplex, high-dimensional features. We evaluate this fundamental architecturalprior for modeling the dense nature of visual signals for a variety of videorecognition tasks where it outperforms concurrent vision transformers that relyon large scale external pre-training and are 5-10x more costly in computationand parameters. We further remove the temporal dimension and apply our modelfor image classification where it outperforms prior work on visiontransformers. Code is available at:


Quick Read (beta)

loading the full paper ...