VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Abstract

Pre-training video transformers on extra large-scale datasets is generallyrequired to achieve premier performance on relatively small datasets. In thispaper, we show that video masked autoencoders (VideoMAE) are data-efficientlearners for self-supervised video pre-training (SSVP). We are inspired by therecent ImageMAE and propose customized video tube masking and reconstruction.These simple designs turn out to be effective for overcoming informationleakage caused by the temporal correlation during video reconstruction. Weobtain three important findings on SSVP: (1) An extremely high proportion ofmasking ratio (i.e., 90% to 95%) still yields favorable performance ofVideoMAE. The temporally redundant video content enables higher masking ratiothan that of images. (2) VideoMAE achieves impressive results on very smalldatasets (i.e., around 3k-4k videos) without using any extra data. This ispartially ascribed to the challenging task of video reconstruction to enforcehigh-level structure learning. (3) VideoMAE shows that data quality is moreimportant than data quantity for SSVP. Domain shift between pre-training andtarget datasets are important issues in SSVP. Notably, our VideoMAE with thevanilla ViT backbone can achieve 83.9% on Kinects-400, 75.3% onSomething-Something V2, 90.8% on UCF101, and 61.1% on HMDB51 without using anyextra data. Code will be released at https://github.com/MCG-NJU/VideoMAE.

Quick Read (beta)

loading the full paper ...