Abstract
Most successful self-supervised learning methods are trained to align therepresentations of two independent views from the data. State-of-the-artmethods in video are inspired by image techniques, where these two views aresimilarly extracted by cropping and augmenting the resulting crop. However,these methods miss a crucial element in the video domain: time. We introduceBraVe, a self-supervised learning framework for video. In BraVe, one of theviews has access to a narrow temporal window of the video while the other viewhas a broad access to the video content. Our models learn to generalise fromthe narrow view to the general content of the video. Furthermore, BraVeprocesses the views with different backbones, enabling the use of alternativeaugmentations or modalities into the broad view such as optical flow, randomlyconvolved RGB frames, audio or their combinations. We demonstrate that BraVeachieves state-of-the-art results in self-supervised representation learning onstandard video and audio classification benchmarks including UCF101, HMDB51,Kinetics, ESC-50 and AudioSet.