Cut2Next: Generating Next Shot via In-Context Tuning

Abstract

Effective multi-shot generation demands purposeful, film-like transitions andstrict cinematic continuity. Current methods, however, often prioritize basicvisual consistency, neglecting crucial editing patterns (e.g., shot/reverseshot, cutaways) that drive narrative flow for compelling storytelling. Thisyields outputs that may be visually coherent but lack narrative sophisticationand true cinematic integrity. To bridge this, we introduce Next Shot Generation(NSG): synthesizing a subsequent, high-quality shot that critically conforms toprofessional editing patterns while upholding rigorous cinematic continuity.Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employsin-context tuning guided by a novel Hierarchical Multi-Prompting strategy. Thisstrategy uses Relational Prompts to define overall context and inter-shotediting styles. Individual Prompts then specify per-shot content andcinematographic attributes. Together, these guide Cut2Next to generatecinematically appropriate next shots. Architectural innovations, Context-AwareCondition Injection (CACI) and Hierarchical Attention Mask (HAM), furtherintegrate these diverse signals without introducing new parameters. Weconstruct RawCuts (large-scale) and CuratedCuts (refined) datasets, both withhierarchical prompts, and introduce CutBench for evaluation. Experiments showCut2Next excels in visual consistency and text fidelity. Crucially, userstudies reveal a strong preference for Cut2Next, particularly for its adherenceto intended editing patterns and overall cinematic continuity, validating itsability to generate high-quality, narratively expressive, and cinematicallycoherent subsequent shots.

Quick Read (beta)

loading the full paper ...