Omnivore: A Single Model for Many Visual Modalities

  • 2022-01-20 18:58:03
  • Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, Ishan Misra
  • 32

Abstract

Prior work has studied different visual modalities in isolation and developedseparate architectures for recognition of images, videos, and 3D data. Instead,in this paper, we propose a single model which excels at classifying images,videos, and single-view 3D data using exactly the same model parameters. Our'Omnivore' model leverages the flexibility of transformer-based architecturesand is trained jointly on classification tasks from different modalities.Omnivore is simple to train, uses off-the-shelf standard datasets, and performsat-par or better than modality-specific models of the same size. A singleOmnivore model obtains 86.0% on ImageNet, 84.1% on Kinetics, and 67.1% on SUNRGB-D. After finetuning, our models outperform prior work on a variety ofvision tasks and generalize across modalities. Omnivore's shared visualrepresentation naturally enables cross-modal recognition without access tocorrespondences between modalities. We hope our results motivate researchers tomodel visual modalities together.

 

Quick Read (beta)

loading the full paper ...