iBoot: Image-bootstrapped Self-Supervised Video Representation Learning

Abstract

Learning visual representations through self-supervision is an extremelychallenging task as the network needs to sieve relevant patterns from spuriousdistractors without the active guidance provided by supervision. This isachieved through heavy data augmentation, large-scale datasets and prohibitiveamounts of compute. Video self-supervised learning (SSL) suffers from addedchallenges: video datasets are typically not as large as image datasets,compute is an order of magnitude larger, and the amount of spurious patternsthe optimizer has to sieve through is multiplied several fold. Thus, directlylearning self-supervised representations from video data might result insub-optimal performance. To address this, we propose to utilize a strongimage-based model, pre-trained with self- or language supervision, in a videorepresentation learning framework, enabling the model to learn strong spatialand temporal information without relying on the video labeled data. To thisend, we modify the typical video-based SSL design and objective to encouragethe video encoder to \textit{subsume} the semantic content of an image-basedmodel trained on a general domain. The proposed algorithm is shown to learnmuch more efficiently (i.e. in less epochs and with a smaller batch) andresults in a new state-of-the-art performance on standard downstream tasksamong single-modality SSL methods.

Quick Read (beta)

loading the full paper ...