ToolRLA: Multiplicative Reward Decomposition for Tool-Integrated Agents

Abstract

Tool-integrated agents that interleave reasoning with API calls are promising for complex tasks, yet aligning them for high-stakes, domain-specific deployment remains challenging: existing reinforcement learning approaches rely on coarse binary rewards that cannot distinguish tool selection errors from malformed parameters. We present ToolRLA, a three-stage post-training pipeline (SFT $\rightarrow$ GRPO $\rightarrow$ DPO) for domain-specific tool agents. The core contribution is a fine-grained reward function with multiplicative correctness decomposition spanning four dimensions -- format validity, tool selection, parameter accuracy, and regulatory compliance -- that encodes domain priority orderings as inductive biases in the reward landscape. Deployed on a financial advisory copilot (80+ advisors, 1,200+ daily queries), ToolRLA achieves over three months: a 47\% improvement in task completion rate ($62\%\rightarrow91\%$), a 63\% reduction in tool invocation errors ($38\%\rightarrow14\%$), and a 93\% reduction in regulatory violations ($12\%\rightarrow0.8\%$), within sub-2-second latency. Ablation studies show the multiplicative reward design accounts for 7 percentage points of improvement over additive alternatives. Generalization is further validated on ToolBench and API-Bank.

Quick Read (beta)

loading the full paper ...