BEVT: BERT Pretraining of Video Transformers

Abstract

This paper studies the BERT pretraining of video transformers. It is astraightforward but worth-studying extension given the recent success from BERTpretraining of image transformers. We introduce BEVT which decouples videorepresentation learning into spatial representation learning and temporaldynamics learning. In particular, BEVT first performs masked image modeling onimage data, and then conducts masked image modeling jointly with masked videomodeling on video data. This design is motivated by two observations: 1)transformers learned on image datasets provide decent spatial priors that canease the learning of video transformers, which are often timescomputationally-intensive if trained from scratch; 2) discriminative clues,i.e., spatial and temporal information, needed to make correct predictions varyamong different videos due to large intra-class and inter-class variations. Weconduct extensive experiments on three challenging video benchmarks where BEVTachieves very promising results. On Kinetics 400, for which recognition mostlyrelies on discriminative spatial representations, BEVT achieves comparableresults to strong supervised baselines. On Something-Something-V2 and Diving48, which contain videos relying on temporal dynamics, BEVT outperforms byclear margins all alternative baselines and achieves state-of-the-artperformance with a 70.6% and 86.7% Top-1 accuracy respectively.

Quick Read (beta)

loading the full paper ...