WorldSimBench: Towards Video Generation Models as World Simulators

Abstract

Recent advancements in predictive models have demonstrated exceptionalcapabilities in predicting the future state of objects and scenes. However, thelack of categorization based on inherent characteristics continues to hinderthe progress of predictive model development. Additionally, existing benchmarksare unable to effectively evaluate higher-capability, highly embodiedpredictive models from an embodied perspective. In this work, we classify thefunctionalities of predictive models into a hierarchy and take the first stepin evaluating World Simulators by proposing a dual evaluation framework calledWorldSimBench. WorldSimBench includes Explicit Perceptual Evaluation andImplicit Manipulative Evaluation, encompassing human preference assessmentsfrom the visual perspective and action-level evaluations in embodied tasks,covering three representative embodied scenarios: Open-Ended EmbodiedEnvironment, Autonomous, Driving, and Robot Manipulation. In the ExplicitPerceptual Evaluation, we introduce the HF-Embodied Dataset, a video assessmentdataset based on fine-grained human feedback, which we use to train a HumanPreference Evaluator that aligns with human perception and explicitly assessesthe visual fidelity of World Simulators. In the Implicit ManipulativeEvaluation, we assess the video-action consistency of World Simulators byevaluating whether the generated situation-aware video can be accuratelytranslated into the correct control signals in dynamic environments. Ourcomprehensive evaluation offers key insights that can drive further innovationin video generation models, positioning World Simulators as a pivotaladvancement toward embodied artificial intelligence.

Quick Read (beta)

loading the full paper ...