Abstract
Transformers today still struggle to generate one-minute videos becauseself-attention layers are inefficient for long context. Alternatives such asMamba layers struggle with complex multi-scene stories because their hiddenstates are less expressive. We experiment with Test-Time Training (TTT) layers,whose hidden states themselves can be neural networks, therefore moreexpressive. Adding TTT layers into a pre-trained Transformer enables it togenerate one-minute videos from text storyboards. For proof of concept, wecurate a dataset based on Tom and Jerry cartoons. Compared to baselines such asMamba~2, Gated DeltaNet, and sliding-window attention layers, TTT layersgenerate much more coherent videos that tell complex stories, leading by 34 Elopoints in a human evaluation of 100 videos per method. Although promising,results still contain artifacts, likely due to the limited capability of thepre-trained 5B model. The efficiency of our implementation can also beimproved. We have only experimented with one-minute videos due to resourceconstraints, but the approach can be extended to longer videos and more complexstories. Sample videos, code and annotations are available at:https://test-time-training.github.io/video-dit