NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls

Abstract

The resurgence of autonomous agents built using large language models (LLMs)to solve complex real-world tasks has brought increased focus on LLMs'fundamental ability of tool or function calling. At the core of these agents,an LLM must plan, execute, and respond using external tools, APIs, and customfunctions. Research on tool calling has gathered momentum, but evaluationbenchmarks and datasets representing the complexity of the tasks have laggedbehind. In this work, we focus on one such complexity, nested sequencing, withthe goal of extending existing benchmarks and evaluation. Specifically, wepresent NESTFUL, a benchmark to evaluate LLMs on nested sequences of API calls,i.e., sequences where the output of one API call is passed as input to asubsequent call. NESTFUL contains 1800+ nested sequences where all the functioncalls are executable. Experimental results on multiple models and settings showthat the best-performing model on the dataset has a full sequence matchaccuracy of 25% and win-rate of 34% necessitating a large scope for improvementin the nested sequencing aspect of function calling. Our analysis of theseresults provides possible future research directions for the community, inaddition to a benchmark to track progress. We have released the NESTFUL datasetunder the Apache 2.0 license at https://github.com/IBM/NESTFUL.

Quick Read (beta)

loading the full paper ...