Abstract
Recent advances in video generation have been driven by diffusion models andautoregressive frameworks, yet critical challenges persist in harmonizingprompt adherence, visual quality, motion dynamics, and duration: compromises inmotion dynamics to enhance temporal visual quality, constrained video duration(5-10 seconds) to prioritize resolution, and inadequate shot-aware generationstemming from general-purpose MLLMs' inability to interpret cinematic grammar,such as shot composition, actor expressions, and camera motions. Theseintertwined limitations hinder realistic long-form synthesis and professionalfilm-style generation. To address these limitations, we propose SkyReels-V2, anInfinite-length Film Generative Model, that synergizes Multi-modal LargeLanguage Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, andDiffusion Forcing Framework. Firstly, we design a comprehensive structuralrepresentation of video that combines the general descriptions by theMulti-modal LLM and the detailed shot language by sub-expert models. Aided withhuman annotation, we then train a unified Video Captioner, namedSkyCaptioner-V1, to efficiently label the video data. Secondly, we establishprogressive-resolution pretraining for the fundamental video generation,followed by a four-stage post-training enhancement: Initial concept-balancedSupervised Fine-Tuning (SFT) improves baseline quality; Motion-specificReinforcement Learning (RL) training with human-annotated and syntheticdistortion data addresses dynamic artifacts; Our diffusion forcing frameworkwith non-decreasing noise schedules enables long-video synthesis in anefficient search space; Final high-quality SFT refines visual fidelity. All thecode and models are available at https://github.com/SkyworkAI/SkyReels-V2.