Abstract
Generating long, temporally consistent video remains an open challenge invideo generation. Primarily due to computational limitations, most priormethods limit themselves to training on a small subset of frames that are thenextended to generate longer videos through a sliding window fashion. Althoughthese techniques may produce sharp videos, they have difficulty retaininglong-term temporal consistency due to their limited context length. In thiswork, we present Temporally Consistent Video Transformer (TECO), avector-quantized latent dynamics video prediction model that learns compressedrepresentations to efficiently condition on long videos of hundreds of framesduring both training and generation. We use a MaskGit prior for dynamicsprediction which enables both sharper and faster generations compared to priorwork. Our experiments show that TECO outperforms SOTA baselines in a variety ofvideo prediction benchmarks ranging from simple mazes in DMLab, large 3D worldsin Minecraft, and complex real-world videos from Kinetics-600. In addition, tobetter understand the capabilities of video prediction models in modelingtemporal consistency, we introduce several challenging video prediction tasksconsisting of agents randomly traversing 3D scenes of varying difficulty. Thispresents a challenging benchmark for video prediction in partially observableenvironments where a model must understand what parts of the scenes tore-create versus invent depending on its past observations or generations.Generated videos are available at https://wilson1yan.github.io/teco