VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention

Abstract

Current video generation models excel at short clips but fail to producecohesive multi-shot narratives due to disjointed visual dynamics and fracturedstorylines. Existing solutions either rely on extensive manualscripting/editing or prioritize single-shot fidelity over cross-scenecontinuity, limiting their practicality for movie-like content. We introduceVideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shotvideo synthesis from a single sentence by systematically addressing three corechallenges: (1) Narrative Fragmentation: Existing methods lack structuredstorytelling. We propose dynamic storyline modeling, which first converts theuser prompt into concise shot descriptions, then elaborates them into detailed,cinematic specifications across five domains (character dynamics, backgroundcontinuity, relationship evolution, camera movements, HDR lighting), ensuringlogical narrative progression with self-validation. (2) Visual Inconsistency:Existing approaches struggle with maintaining visual consistency across shots.Our identity-aware cross-shot propagation generates identity-preservingportrait (IPP) tokens that maintain character fidelity while allowing traitvariations (expressions, aging) dictated by the storyline. (3) TransitionArtifacts: Abrupt shot changes disrupt immersion. Our adjacent latenttransition mechanisms implement boundary-aware reset strategies that processadjacent shots' features at transition points, enabling seamless visual flowwhile preserving narrative continuity. VGoT generates multi-shot videos thatoutperform state-of-the-art baselines by 20.4% in within-shot face consistencyand 17.4% in style consistency, while achieving over 100% better cross-shotconsistency and 10x fewer manual adjustments than alternatives.

Quick Read (beta)

loading the full paper ...