Can We Rely on LLM Agents to Draft Long-Horizon Plans? Let's Take TravelPlanner as an Example

Abstract

Large language models (LLMs) have brought autonomous agents closer toartificial general intelligence (AGI) due to their promising generalization andemergent capabilities. There is, however, a lack of studies on how LLM-basedagents behave, why they could potentially fail, and how to improve them,particularly in demanding real-world planning tasks. In this paper, as aneffort to fill the gap, we present our study using a realistic benchmark,TravelPlanner, where an agent must meet multiple constraints to generateaccurate plans. We leverage this benchmark to address four key researchquestions: (1) are LLM agents robust enough to lengthy and noisy contexts whenit comes to reasoning and planning? (2) can few-shot prompting adversely impactthe performance of LLM agents in scenarios with long context? (3) can we relyon refinement to improve plans, and (4) can fine-tuning LLMs with both positiveand negative feedback lead to further improvement? Our comprehensiveexperiments indicate that, firstly, LLMs often fail to attend to crucial partsof a long context, despite their ability to handle extensive referenceinformation and few-shot examples; secondly, they still struggle with analyzingthe long plans and cannot provide accurate feedback for refinement; thirdly, wepropose Feedback-Aware Fine-Tuning (FAFT), which leverages both positive andnegative feedback, resulting in substantial gains over Supervised Fine-Tuning(SFT). Our findings offer in-depth insights to the community on various aspectsrelated to real-world planning applications.

Quick Read (beta)

loading the full paper ...