SkyReels-V2: Infinite-length Film Generative Model

Abstract

Recent advances in video generation have been driven by diffusion models andautoregressive frameworks, yet critical challenges persist in harmonizingprompt adherence, visual quality, motion dynamics, and duration: compromises inmotion dynamics to enhance temporal visual quality, constrained video duration(5-10 seconds) to prioritize resolution, and inadequate shot-aware generationstemming from general-purpose MLLMs' inability to interpret cinematic grammar,such as shot composition, actor expressions, and camera motions. Theseintertwined limitations hinder realistic long-form synthesis and professionalfilm-style generation. To address these limitations, we propose SkyReels-V2, anInfinite-length Film Generative Model, that synergizes Multi-modal LargeLanguage Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, andDiffusion Forcing Framework. Firstly, we design a comprehensive structuralrepresentation of video that combines the general descriptions by theMulti-modal LLM and the detailed shot language by sub-expert models. Aided withhuman annotation, we then train a unified Video Captioner, namedSkyCaptioner-V1, to efficiently label the video data. Secondly, we establishprogressive-resolution pretraining for the fundamental video generation,followed by a four-stage post-training enhancement: Initial concept-balancedSupervised Fine-Tuning (SFT) improves baseline quality; Motion-specificReinforcement Learning (RL) training with human-annotated and syntheticdistortion data addresses dynamic artifacts; Our diffusion forcing frameworkwith non-decreasing noise schedules enables long-video synthesis in anefficient search space; Final high-quality SFT refines visual fidelity. All thecode and models are available at https://github.com/SkyworkAI/SkyReels-V2.

Quick Read (beta)

loading the full paper ...