AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?

Abstract

Large Language Model (LLM)-based agentic systems, often comprising multiplemodels, complex tool invocations, and orchestration protocols, substantiallyoutperform monolithic agents. Yet this very sophistication amplifies theirfragility, making them more prone to system failure. Pinpointing the specificagent or step responsible for an error within long execution traces defines thetask of agentic system failure attribution. Current state-of-the-art reasoningLLMs, however, remain strikingly inadequate for this challenge, with accuracygenerally below 10%. To address this gap, we propose AgenTracer, the firstautomated framework for annotating failed multi-agent trajectories viacounterfactual replay and programmed fault injection, producing the curateddataset TracerTraj. Leveraging this resource, we develop AgenTracer-8B, alightweight failure tracer trained with multi-granular reinforcement learning,capable of efficiently diagnosing errors in verbose multi-agent interactions.On the Who&When benchmark, AgenTracer-8B outperforms giant proprietary LLMslike Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%, setting a new standardin LLM agentic failure attribution. More importantly, AgenTracer-8B deliversactionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaASwith 4.8-14.2% performance gains, empowering self-correcting and self-evolvingagentic AI.

Quick Read (beta)

loading the full paper ...