Abstract
Story visualization aims to generate coherent image sequences that faithfullydepict a narrative and align with character references. Despite progress ingenerative models, existing benchmarks are narrow in scope, often limited toshort prompts, no character reference, or single-image cases, and fall short ofreal-world storytelling complexity. This hinders a nuanced understanding ofmodel capabilities and limitations. We present ViStoryBench, a comprehensivebenchmark designed to evaluate story visualization models across diversenarrative structures, visual styles, and character settings. The benchmarkfeatures richly annotated multi-shot scripts derived from curated storiesspanning literature, film, and folklore. Large language models assist in storysummarization and script generation, with all outputs verified by humans toensure coherence and fidelity. Character references are carefully curated tomaintain intra-story consistency across varying artistic styles. To enablethorough evaluation, ViStoryBench introduces a set of automated metrics thatassess character consistency, style similarity, prompt adherence, aestheticquality, and generation artifacts such as copy-paste behavior. These metricsare validated through human studies, and used to benchmark a broad range ofopen-source and commercial models. ViStoryBench offers a high-fidelity,multi-dimensional evaluation suite that facilitates systematic analysis andfosters future progress in visual storytelling.