Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks

Abstract

Autonomous agent systems powered by Large Language Models (LLMs) havedemonstrated promising capabilities in automating complex tasks. However,current evaluations largely rely on success rates without systematicallyanalyzing the interactions, communication mechanisms, and failure causes withinthese systems. To bridge this gap, we present a benchmark of 34 representativeprogrammable tasks designed to rigorously assess autonomous agents. Using thisbenchmark, we evaluate three popular open-source agent frameworks combined withtwo LLM backbones, observing a task completion rate of approximately 50%.Through in-depth failure analysis, we develop a three-tier taxonomy of failurecauses aligned with task phases, highlighting planning errors, task executionissues, and incorrect response generation. Based on these insights, we proposeactionable improvements to enhance agent planning and self-diagnosiscapabilities. Our failure taxonomy, together with mitigation advice, providesan empirical foundation for developing more robust and effective autonomousagent systems in the future.

Quick Read (beta)

loading the full paper ...