Abstract
The recent rapid advancement of Text-to-Video (T2V) generation technologies,which are critical to build ``world models'', makes the existing benchmarksincreasingly insufficient to evaluate state-of-the-art T2V models. First,current evaluation dimensions, such as per-frame aesthetic quality and temporalconsistency, are no longer able to differentiate state-of-the-art T2V models.Second, event-level temporal causality, which not only distinguishes video fromother modalities but also constitutes a crucial component of world models, isseverely underexplored in existing benchmarks. Third, existing benchmarks lacka systematic assessment of world knowledge, which are essential capabilitiesfor building world models. To address these issues, we introduce VideoVerse, acomprehensive benchmark that focuses on evaluating whether a T2V model couldunderstand complex temporal causality and world knowledge in the real world. Wecollect representative videos across diverse domains (e.g., natural landscapes,sports, indoor scenes, science fiction, chemical and physical experiments) andextract their event-level descriptions with inherent temporal causality, whichare then rewritten into text-to-video prompts by independent annotators. Foreach prompt, we design a suite of binary evaluation questions from theperspective of dynamic and static properties, with a total of ten carefullydefined evaluation dimensions. In total, our VideoVerse comprises 300 carefullycurated prompts, involving 815 events and 793 binary evaluation questions.Consequently, a human preference aligned QA-based evaluation pipeline isdeveloped by using modern vision-language models. Finally, we perform asystematic evaluation of state-of-the-art open-source and closed-source T2Vmodels on VideoVerse, providing in-depth analysis on how far the current T2Vgenerators are from world models.