AutoLibra: Agent Metric Induction from Open-Ended Feedback

Abstract

Agents are predominantly evaluated and optimized via task success metrics,which are coarse, rely on manual design from experts, and fail to rewardintermediate emergent behaviors. We propose AutoLibra, a framework for agentevaluation, that transforms open-ended human feedback, e.g., "If you find thatthe button is disabled, don't click it again", or "This agent has too muchautonomy to decide what to do on its own", into metrics for evaluatingfine-grained behaviors in agent trajectories. AutoLibra accomplishes this bygrounding feedback to an agent's behavior, clustering similar positive andnegative behaviors, and creating concrete metrics with clear definitions andconcrete examples, which can be used for prompting LLM-as-a-Judge asevaluators. We further propose two meta-metrics to evaluate the alignment of aset of (induced) metrics with open feedback: "coverage" and "redundancy".Through optimizing these meta-metrics, we experimentally demonstrateAutoLibra's ability to induce more concrete agent evaluation metrics than theones proposed in previous agent evaluation benchmarks and discover new metricsto analyze agents. We also present two applications of AutoLibra in agentimprovement: First, we show that AutoLibra-induced metrics serve as betterprompt-engineering targets than the task success rate on a wide range of textgame tasks, improving agent performance over baseline by a mean of 20%. Second,we show that AutoLibra can iteratively select high-quality fine-tuning data forweb navigation agents. Our results suggest that AutoLibra is a powerfultask-agnostic tool for evaluating and improving language agents.

Quick Read (beta)

loading the full paper ...