Abstract
Videos are a rich source of multi-modal supervision. In this work, we learnrepresentations using self-supervision by leveraging three modalities naturallypresent in videos: vision, audio and language. To this end, we introduce thenotion of a multimodal versatile network -- a network that can ingest multiplemodalities and whose representations enable downstream tasks in multiplemodalities. In particular, we explore how best to combine the modalities, suchthat fine-grained representations of audio and vision can be maintained, whilstalso integrating text into a common embedding. Driven by versatility, we alsointroduce a novel process of deflation, so that the networks can beeffortlessly applied to the visual data in the form of video or a static image.We demonstrate how such networks trained on large collections of unlabelledvideo data can be applied on video, video-text, image and audio tasks. Equippedwith these representations, we obtain state-of-the-art performance on multiplechallenging benchmarks including UCF101, HMDB51 and ESC-50 when compared toprevious self-supervised work.