Evaluating Mathematical Reasoning Beyond Accuracy

Abstract

The leaderboard of Large Language Models (LLMs) in mathematical tasks hasbeen continuously updated. However, the majority of evaluations focus solely onthe final results, neglecting the quality of the intermediate steps. Thisoversight can mask underlying problems, such as logical errors or unnecessarysteps in the reasoning process. To measure reasoning beyond final-answeraccuracy, we introduce ReasonEval, a new methodology for evaluating the qualityof reasoning steps. ReasonEval employs $\textit{validity}$ and$\textit{redundancy}$ to characterize the reasoning quality, as well asaccompanying LLMs to assess them automatically. Instantiated by base modelsthat possess strong mathematical knowledge and trained with high-qualitylabeled data, ReasonEval achieves state-of-the-art performance on human-labeleddatasets and can accurately detect different types of errors generated byperturbation. When applied to evaluate LLMs specialized in math, we find thatan increase in final-answer accuracy does not necessarily guarantee animprovement in the overall quality of the reasoning steps for challengingmathematical problems. Additionally, we observe that ReasonEval can play asignificant role in data selection. We release the best-performing model,meta-evaluation script, and all evaluation results athttps://github.com/GAIR-NLP/ReasonEval.

Quick Read (beta)

loading the full paper ...