Abstract
Recent advances in reinforcement learning (RL) have led to substantialimprovements in the mathematical reasoning abilities of large language models(LLMs), as measured by standard benchmarks. However, these gains often persisteven when models are trained with flawed signals, such as random or invertedrewards, raising a fundamental question: do such improvements reflect truereasoning, or are they merely artifacts of overfitting to benchmark-specificpatterns? To address this question, we take an evaluation-centric perspectiveand identify two critical shortcomings in existing protocols. First,\emph{benchmark contamination} arises from the public availability of testproblems, increasing the risk of data leakage. Second, \emph{evaluationfragility} stems from the reliance on single-instance assessments, which arehighly sensitive to stochastic outputs and fail to capture reasoningconsistency. To overcome these limitations, we introduce {VAR-MATH}, a symbolicevaluation framework designed to probe genuine reasoning ability. By convertingfixed numerical problems into symbolic templates and requiring models to solvemultiple instantiations of each, VAR-MATH enforces consistent reasoning acrossstructurally equivalent variants, thereby mitigating contamination andimproving evaluation robustness. We apply VAR-MATH to transform two popularbenchmarks, AMC23 and AIME24, into their symbolic counterparts, VAR-AMC23 andVAR-AIME24. Experimental results reveal substantial performance drops forRL-trained models on the variabilized versions, especially for smaller models,with average declines of 48.0\% on AMC23 and 58.3\% on AIME24. These findingssuggest that many existing RL methods rely on superficial heuristics and failto generalize beyond specific numerical forms. Overall, VAR-MATH offers aprincipled, contamination-resistant evaluation paradigm for mathematicalreasoning.