PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

  • 2025-05-29 18:59:14
  • Hui Shen, Taiqiang Wu, Qi Han, Yunta Hsieh, Jizhou Wang, Yuyue Zhang, Yuxin Cheng, Zijian Hao, Yuansheng Ni, Xin Wang, Zhongwei Wan, Kai Zhang, Wendong Xu, Jing Xiong, Ping Luo, Wenhu Chen, Chaofan Tao, Zhuoqing Mao, Ngai Wong
  • 0

Abstract

Existing benchmarks fail to capture a crucial aspect of intelligence:physical reasoning, the integrated ability to combine domain knowledge,symbolic reasoning, and understanding of real-world constraints. To addressthis gap, we introduce PhyX: the first large-scale benchmark designed to assessmodels capacity for physics-grounded reasoning in visual scenarios. PhyXincludes 3K meticulously curated multimodal questions spanning 6 reasoningtypes across 25 sub-domains and 6 core physics domains: thermodynamics,electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. Inour comprehensive evaluation, even state-of-the-art models strugglesignificantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, andGPT-o4-mini achieve only 32.5%, 42.2%, and 45.8% accuracyrespectively-performance gaps exceeding 29% compared to human experts. Ouranalysis exposes critical limitations in current models: over-reliance onmemorized disciplinary knowledge, excessive dependence on mathematicalformulations, and surface-level visual pattern matching rather than genuinephysical understanding. We provide in-depth analysis through fine-grainedstatistics, detailed case studies, and multiple evaluation paradigms tothoroughly examine physical reasoning capabilities. To ensure reproducibility,we implement a compatible evaluation protocol based on widely-used toolkitssuch as VLMEvalKit, enabling one-click evaluation. More details are availableon our project page: https://phyx-bench.github.io/.

 

Quick Read (beta)

loading the full paper ...