Abstract
State-of-the-art text-to-video models excel at generating isolated clips butfall short of creating the coherent, multi-shot narratives, which are theessence of storytelling. We bridge this "narrative gap" with HoloCine, a modelthat generates entire scenes holistically to ensure global consistency from thefirst shot to the last. Our architecture achieves precise directorial controlthrough a Window Cross-Attention mechanism that localizes text prompts tospecific shots, while a Sparse Inter-Shot Self-Attention pattern (dense withinshots but sparse between them) ensures the efficiency required for minute-scalegeneration. Beyond setting a new state-of-the-art in narrative coherence,HoloCine develops remarkable emergent abilities: a persistent memory forcharacters and scenes, and an intuitive grasp of cinematic techniques. Our workmarks a pivotal shift from clip synthesis towards automated filmmaking, makingend-to-end cinematic creation a tangible future. Our code is available at:https://holo-cine.github.io/.