Abstract
The emergence of LLM-based agents represents a paradigm shift in AI, enablingautonomous systems to plan, reason, use tools, and maintain memory whileinteracting with dynamic environments. This paper provides the firstcomprehensive survey of evaluation methodologies for these increasingly capableagents. We systematically analyze evaluation benchmarks and frameworks acrossfour critical dimensions: (1) fundamental agent capabilities, includingplanning, tool use, self-reflection, and memory; (2) application-specificbenchmarks for web, software engineering, scientific, and conversationalagents; (3) benchmarks for generalist agents; and (4) frameworks for evaluatingagents. Our analysis reveals emerging trends, including a shift toward morerealistic, challenging evaluations with continuously updated benchmarks. Wealso identify critical gaps that future research must address-particularly inassessing cost-efficiency, safety, and robustness, and in developingfine-grained, and scalable evaluation methods. This survey maps the rapidlyevolving landscape of agent evaluation, reveals the emerging trends in thefield, identifies current limitations, and proposes directions for futureresearch.