ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

Abstract

We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) andcontrastive learning. ViC-MAE is trained using a global featured obtained bypooling the local representations learned under an MAE reconstruction loss andleveraging this representation under a contrastive objective across images andvideo frames. We show that visual representations learned under ViC-MAEgeneralize well to both video and image classification tasks. Particularly,ViC-MAE obtains state-of-the-art transfer learning performance from video toimages on Imagenet-1k compared to the recently proposed OmniMAE by achieving atop-1 accuracy of 86% (+1.3% absolute improvement) when trained on the samedata and 87.1% (+2.4% absolute improvement) when training on extra data. At thesame time ViC-MAE outperforms most other methods on video benchmarks byobtaining 75.9% top-1 accuracy on the challenging Something something-v2 videobenchmark . When training on videos and images from a diverse combination ofdatasets, our method maintains a balanced transfer-learning performance betweenvideo and image classification benchmarks, coming only as a close second to thebest supervised method.

Quick Read (beta)

loading the full paper ...