Abstract
Scaling has not yet been convincingly demonstrated for pure self-supervisedlearning from video. However, prior work has focused evaluations onsemantic-related tasks $\unicode{x2013}$ action classification, ImageNetclassification, etc. In this paper we focus on evaluating self-supervisedlearning on non-semantic vision tasks that are more spatial (3D) and temporal(+1D = 4D), such as camera pose estimation, point and object tracking, anddepth estimation. We show that by learning from very large video datasets,masked auto-encoding (MAE) with transformer video models actually scales,consistently improving performance on these 4D tasks, as model size increasesfrom 20M all the way to the largest by far reported self-supervised video model$\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison withmany recent image and video models demonstrates the benefits of scaling 4Drepresentations.