Omnivore: A Single Model for Many Visual Modalities

Abstract

Prior work has studied different visual modalities in isolation and developedseparate architectures for recognition of images, videos, and 3D data. Instead,in this paper, we propose a single model which excels at classifying images,videos, and single-view 3D data using exactly the same model parameters. Our'Omnivore' model leverages the flexibility of transformer-based architecturesand is trained jointly on classification tasks from different modalities.Omnivore is simple to train, uses off-the-shelf standard datasets, and performsat-par or better than modality-specific models of the same size. A singleOmnivore model obtains 86.0% on ImageNet, 84.1% on Kinetics, and 67.1% on SUNRGB-D. After finetuning, our models outperform prior work on a variety ofvision tasks and generalize across modalities. Omnivore's shared visualrepresentation naturally enables cross-modal recognition without access tocorrespondences between modalities. We hope our results motivate researchers tomodel visual modalities together.

Quick Read (beta)

loading the full paper ...