Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Abstract

Diffusion models have revolutionized image and video generation, achievingunprecedented visual quality. However, their reliance on transformerarchitectures incurs prohibitively high computational costs, particularly whenextending generation to long videos. Recent work has explored autoregressiveformulations for long video generation, typically by distilling fromshort-horizon bidirectional teachers. Nevertheless, given that teacher modelscannot synthesize long videos, the extrapolation of student models beyond theirtraining horizon often leads to pronounced quality degradation, arising fromthe compounding of errors within the continuous latent space. In this paper, wepropose a simple yet effective approach to mitigate quality degradation inlong-horizon video generation without requiring supervision from long-videoteachers or retraining on long video datasets. Our approach centers onexploiting the rich knowledge of teacher models to provide guidance for thestudent model through sampled segments drawn from self-generated long videos.Our method maintains temporal consistency while scaling video length by up to20x beyond teacher's capability, avoiding common issues such as over-exposureand error-accumulation without recomputing overlapping frames like previousmethods. When scaling up the computation, our method shows the capability ofgenerating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of themaximum span supported by our base model's position embedding and more than 50xlonger than that of our baseline model. Experiments on standard benchmarks andour proposed improved benchmark demonstrate that our approach substantiallyoutperforms baseline methods in both fidelity and consistency. Our long-horizonvideos demo can be found at https://self-forcing-plus-plus.github.io/

Quick Read (beta)

loading the full paper ...