Abstract
We empirically study autoregressive pre-training from videos. To perform ourstudy, we construct a series of autoregressive video models, called Toto. Wetreat videos as sequences of visual tokens and train transformer models toautoregressively predict future tokens. Our models are pre-trained on a diversedataset of videos and images comprising over 1 trillion visual tokens. Weexplore different architectural, training, and inference design choices. Weevaluate the learned visual representations on a range of downstream tasksincluding image recognition, video classification, object tracking, androbotics. Our results demonstrate that, despite minimal inductive biases,autoregressive pre-training leads to competitive performance across allbenchmarks. Finally, we find that scaling our video models results in similarscaling curves to those seen in language models, albeit with a different rate.More details at https://brjathu.github.io/toto/