OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

Abstract

This paper presents OmniVL, a new foundation model to support bothimage-language and video-language tasks using one universal architecture. Itadopts a unified transformer-based visual encoder for both image and videoinputs, and thus can perform joint image-language and video-languagepretraining. We demonstrate, for the first time, such a paradigm benefits bothimage and video tasks, as opposed to the conventional one-directional transfer(e.g., use image-language to help video-language). To this end, we propose adecoupled joint pretraining of image-language and video-language to effectivelydecompose the vision-language modeling into spatial and temporal dimensions andobtain performance boost on both image and video tasks. Moreover, we introducea novel unified vision-language contrastive (UniVLC) loss to leverageimage-text, video-text, image-label (e.g., image classification), video-label(e.g., video action recognition) data together, so that both supervised andnoisily supervised pretraining data are utilized as much as possible. Withoutincurring extra task-specific adaptors, OmniVL can simultaneously supportvisual only tasks (e.g., image classification, video action recognition),cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modalunderstanding and generation tasks (e.g., image/video question answering,captioning). We evaluate OmniVL on a wide range of downstream tasks and achievestate-of-the-art or competitive results with similar model size and data scale.

Quick Read (beta)

loading the full paper ...