Abstract
Existing benchmarks fail to capture a crucial aspect of intelligence:physical reasoning, the integrated ability to combine domain knowledge,symbolic reasoning, and understanding of real-world constraints. To addressthis gap, we introduce PhyX: the first large-scale benchmark designed to assessmodels capacity for physics-grounded reasoning in visual scenarios. PhyXincludes 3K meticulously curated multimodal questions spanning 6 reasoningtypes across 25 sub-domains and 6 core physics domains: thermodynamics,electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. Inour comprehensive evaluation, even state-of-the-art models strugglesignificantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, andGPT-o4-mini achieve only 32.5%, 42.2%, and 45.8% accuracyrespectively-performance gaps exceeding 29% compared to human experts. Ouranalysis exposes critical limitations in current models: over-reliance onmemorized disciplinary knowledge, excessive dependence on mathematicalformulations, and surface-level visual pattern matching rather than genuinephysical understanding. We provide in-depth analysis through fine-grainedstatistics, detailed case studies, and multiple evaluation paradigms tothoroughly examine physical reasoning capabilities. To ensure reproducibility,we implement a compatible evaluation protocol based on widely-used toolkitssuch as VLMEvalKit, enabling one-click evaluation. More details are availableon our project page: https://phyx-bench.github.io/.