Abstract
Large Language Models (LLMs) have achieved remarkable progress on advancedreasoning tasks such as mathematics and coding competitions. Meanwhile,physics, despite being both reasoning-intensive and essential to real-worldunderstanding, received limited academic and industrial attention. This paperintroduces PHYSICS, a dataset containing 16,568 high-quality physics problemsspanning subjects and difficulty levels, to facilitate this issue.Specifically, PHYSICS is curated with exercises from over 100 textbooks througha carefully designed pipeline for quality control. It covers five major physicsdomains: Mechanics, Electromagnetism, Thermodynamics, Optics, and ModernPhysics. It also spans a wide range of difficulty levels, from high school tograduate-level physics courses. To utilize the data for improving andevaluating the model's physical reasoning capabilities, we split the datasetinto training and test sets, and provide reasoning paths generated by powerfulreasoning models for the training data to facilitate model training. Inaddition, for the evaluation part, we find that existing evaluation frameworksexhibit biases in aspects such as units, simplification, and precision inphysics domain. To balance efficiency and accuracy, we introduce a Rule+Modelevaluation framework tailored to physics problems. Our evaluations on currentstate-of-the-art open-source and proprietary models highlight the limitationsof current models in handling physics-related tasks. We hope that our datasetand evaluation methodology will jointly advance the development of LLMs in thefield of physics.