Abstract
We introduce InfinityStar, a unified spacetime autoregressive framework forhigh-resolution image and dynamic video synthesis. Building on the recentsuccess of autoregressive modeling in both vision and language, our purelydiscrete approach jointly captures spatial and temporal dependencies within asingle architecture. This unified design naturally supports a variety ofgeneration tasks such as text-to-image, text-to-video, image-to-video, and longinteractive video synthesis via straightforward temporal autoregression.Extensive experiments demonstrate that InfinityStar scores 83.74 on VBench,outperforming all autoregressive models by large margins, even surpassing somediffusion competitors like HunyuanVideo. Without extra optimizations, our modelgenerates a 5s, 720p video approximately 10x faster than leadingdiffusion-based methods. To our knowledge, InfinityStar is the first discreteautoregressive video generator capable of producing industrial level 720pvideos. We release all code and models to foster further research in efficient,high-quality video generation.