Abstract
Recent advancements in Large Language Models (LLMs) have sparked interest intheir formal reasoning capabilities, particularly in mathematics. The GSM8Kbenchmark is widely used to assess the mathematical reasoning of models ongrade-school-level questions. While the performance of LLMs on GSM8K hassignificantly improved in recent years, it remains unclear whether theirmathematical reasoning capabilities have genuinely advanced, raising questionsabout the reliability of the reported metrics. To address these concerns, weconduct a large-scale study on several SOTA open and closed models. To overcomethe limitations of existing evaluations, we introduce GSM-Symbolic, an improvedbenchmark created from symbolic templates that allow for the generation of adiverse set of questions. GSM-Symbolic enables more controllable evaluations,providing key insights and more reliable metrics for measuring the reasoningcapabilities of models.Our findings reveal that LLMs exhibit noticeablevariance when responding to different instantiations of the same question.Specifically, the performance of all models declines when only the numericalvalues in the question are altered in the GSM-Symbolic benchmark. Furthermore,we investigate the fragility of mathematical reasoning in these models and showthat their performance significantly deteriorates as the number of clauses in aquestion increases. We hypothesize that this decline is because current LLMscannot perform genuine logical reasoning; they replicate reasoning steps fromtheir training data. Adding a single clause that seems relevant to the questioncauses significant performance drops (up to 65%) across all state-of-the-artmodels, even though the clause doesn't contribute to the reasoning chain neededfor the final answer. Overall, our work offers a more nuanced understanding ofLLMs' capabilities and limitations in mathematical reasoning.