Abstract
Masked-based autoregressive models have demonstrated promising imagegeneration capability in continuous space. However, their potential for videogeneration remains under-explored. In this paper, we propose \textbf{VideoMAR},a concise and efficient decoder-only autoregressive image-to-video model withcontinuous tokens, composing temporal frame-by-frame and spatial maskedgeneration. We first identify temporal causality and spatial bi-directionalityas the first principle of video AR models, and propose the next-frame diffusionloss for the integration of mask and video generation. Besides, the huge costand difficulty of long sequence autoregressive modeling is a basic but crucialissue. To this end, we propose the temporal short-to-long curriculum learningand spatial progressive resolution training, and employ progressive temperaturestrategy at inference time to mitigate the accumulation error. Furthermore,VideoMAR replicates several unique capacities of language models to videogeneration. It inherently bears high efficiency due to simultaneoustemporal-wise KV cache and spatial-wise parallel generation, and presents thecapacity of spatial and temporal extrapolation via 3D rotary embeddings. On theVBench-I2V benchmark, VideoMAR surpasses the previous state-of-the-art (CosmosI2V) while requiring significantly fewer parameters ($9.3\%$), training data($0.5\%$), and GPU resources ($0.2\%$).