Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models

Abstract

Recent progress in text-to-video (T2V) generation has enabled the synthesisof visually compelling and temporally coherent videos from natural language.However, these models often fall short in basic physical commonsense, producingoutputs that violate intuitive expectations around causality, object behavior,and tool use. Addressing this gap, we present PhysVidBench, a benchmarkdesigned to evaluate the physical reasoning capabilities of T2V systems. Thebenchmark includes 383 carefully curated prompts, emphasizing tool use,material properties, and procedural interactions, and domains where physicalplausibility is crucial. For each prompt, we generate videos using diversestate-of-the-art models and adopt a three-stage evaluation pipeline: (1)formulate grounded physics questions from the prompt, (2) caption the generatedvideo with a vision-language model, and (3) task a language model to answerseveral physics-involved questions using only the caption. This indirectstrategy circumvents common hallucination issues in direct video-basedevaluation. By highlighting affordances and tool-mediated actions, areasoverlooked in current T2V evaluations, PhysVidBench provides a structured,interpretable framework for assessing physical commonsense in generative videomodels.

Quick Read (beta)

loading the full paper ...