Abstract
Evaluating the output of Large Language Models (LLMs) is one of the mostcritical aspects of building a performant compound AI system. Since the outputfrom LLMs propagate to downstream steps, identifying LLM errors is crucial tosystem performance. A common task for LLMs in AI systems is tool use. Whilethere are several benchmark environments for evaluating LLMs on this task, theytypically only give a success rate without any explanation of the failurecases. To solve this problem, we introduce SpecTool, a new benchmark toidentify error patterns in LLM output on tool-use tasks. Our benchmark data setcomprises of queries from diverse environments that can be used to test for thepresence of seven newly characterized error patterns. Using SPECTOOL , we showthat even the most prominent LLMs exhibit these error patterns in theiroutputs. Researchers can use the analysis and insights from SPECTOOL to guidetheir error mitigation strategies.