An Empirical Study of Autoregressive Pre-training from Videos

  • 2025-01-09 18:59:58
  • Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, Jitendra Malik
  • 0

Abstract

We empirically study autoregressive pre-training from videos. To perform ourstudy, we construct a series of autoregressive video models, called Toto. Wetreat videos as sequences of visual tokens and train transformer models toautoregressively predict future tokens. Our models are pre-trained on a diversedataset of videos and images comprising over 1 trillion visual tokens. Weexplore different architectural, training, and inference design choices. Weevaluate the learned visual representations on a range of downstream tasksincluding image recognition, video classification, object tracking, androbotics. Our results demonstrate that, despite minimal inductive biases,autoregressive pre-training leads to competitive performance across allbenchmarks. Finally, we find that scaling our video models results in similarscaling curves to those seen in language models, albeit with a different rate.More details at https://brjathu.github.io/toto/

 

Quick Read (beta)

loading the full paper ...