SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs

Abstract

Evaluating the output of Large Language Models (LLMs) is one of the mostcritical aspects of building a performant compound AI system. Since the outputfrom LLMs propagate to downstream steps, identifying LLM errors is crucial tosystem performance. A common task for LLMs in AI systems is tool use. Whilethere are several benchmark environments for evaluating LLMs on this task, theytypically only give a success rate without any explanation of the failurecases. To solve this problem, we introduce SpecTool, a new benchmark toidentify error patterns in LLM output on tool-use tasks. Our benchmark data setcomprises of queries from diverse environments that can be used to test for thepresence of seven newly characterized error patterns. Using SPECTOOL , we showthat even the most prominent LLMs exhibit these error patterns in theiroutputs. Researchers can use the analysis and insights from SPECTOOL to guidetheir error mitigation strategies.

Quick Read (beta)

loading the full paper ...