A Compute-Matched Re-Evaluation of TroVE on MATH

Abstract

Reusing established theorems and formulas is central to mathematical problemsolving, serving as essential building blocks for tackling increasingly complexchallenges. Recent work, TroVE, argues that code-generating Large LanguageModels (LLMs) can benefit similarly on the MATH benchmark by inducing andreusing higher-level toolboxes. By allocating computational budget across anensemble of three modes -- directly generating code, creating tools, andreusing tools -- TroVE claims to outperform a PRIMITIVE baseline that onlyperforms direct generation. However, recent analysis (Berlot-Attwell et al.,2024) casts doubt on these gains, noting that the tools created are oftentrivial or rarely reused, suggesting that improvements may stem fromself-consistency or self-correction. In this work, we re-evaluate TroVE onMATH, analyze the impact of each of its modes, and show that its benefit doesnot come from these mechanisms, but simply from a higher computational budgetspent for TroVE compared to PRIMITIVE. To this end, we also perform a smallcorrection in the original implementation of TroVE's selection mechanism,boosting TroVE's performance on MATH by 3\% in accuracy. After matching forcompute, the benefit of TroVE reduces to a marginal improvement of 1\%,suggesting that this toolbox approach does not provide a significant benefit onMATH.

Quick Read (beta)

loading the full paper ...