Abstract
Large language models (LLMs) have demonstrated that large-scale pretrainingenables systems to adapt rapidly to new problems with little supervision in thelanguage domain. This success, however, has not translated as effectively tothe visual domain, where models, including LLMs, continue to struggle withcompositional understanding, sample efficiency, and general-purposeproblem-solving. We investigate Video Diffusion Models (VDMs) as a promisingdirection for bridging this gap. Pretraining on spatiotemporal data endowsthese models with strong inductive biases for structure and dynamics, which wehypothesize can support broad task adaptability. To test this, we design acontrolled evaluation in which both a pretrained LLM and a pretrained VDM areequipped with lightweight adapters and presented with tasks in their naturalmodalities. Across benchmarks including ARC-AGI, ConceptARC, visual games,route planning, and cellular automata, VDMs demonstrate higher data efficiencythan their language counterparts. Taken together, our results indicate thatvideo pretraining offers inductive biases that support progress toward visualfoundation models.