Taming Teacher Forcing for Masked Autoregressive Video Generation

Abstract

We introduce MAGI, a hybrid video generation framework that combines maskedmodeling for intra-frame generation with causal modeling for next-framegeneration. Our key innovation, Complete Teacher Forcing (CTF), conditionsmasked frames on complete observation frames rather than masked ones (namelyMasked Teacher Forcing, MTF), enabling a smooth transition from token-level(patch-level) to frame-level autoregressive generation. CTF significantlyoutperforms MTF, achieving a +23% improvement in FVD scores on first-frameconditioned video prediction. To address issues like exposure bias, we employtargeted training strategies, setting a new benchmark in autoregressive videogeneration. Experiments show that MAGI can generate long, coherent videosequences exceeding 100 frames, even when trained on as few as 16 frames,highlighting its potential for scalable, high-quality video generation.

Quick Read (beta)

loading the full paper ...