Can We Rely on LLM Agents to Draft Long-Horizon Plans? Let's Take TravelPlanner as an Example

  • 2024-08-12 18:39:01
  • Yanan Chen, Ali Pesaranghader, Tanmana Sadhu, Dong Hoon Yi
  • 0

Abstract

Large language models (LLMs) have brought autonomous agents closer toartificial general intelligence (AGI) due to their promising generalization andemergent capabilities. There is, however, a lack of studies on how LLM-basedagents behave, why they could potentially fail, and how to improve them,particularly in demanding real-world planning tasks. In this paper, as aneffort to fill the gap, we present our study using a realistic benchmark,TravelPlanner, where an agent must meet multiple constraints to generateaccurate plans. We leverage this benchmark to address four key researchquestions: (1) are LLM agents robust enough to lengthy and noisy contexts whenit comes to reasoning and planning? (2) can few-shot prompting adversely impactthe performance of LLM agents in scenarios with long context? (3) can we relyon refinement to improve plans, and (4) can fine-tuning LLMs with both positiveand negative feedback lead to further improvement? Our comprehensiveexperiments indicate that, firstly, LLMs often fail to attend to crucial partsof a long context, despite their ability to handle extensive referenceinformation and few-shot examples; secondly, they still struggle with analyzingthe long plans and cannot provide accurate feedback for refinement; thirdly, wepropose Feedback-Aware Fine-Tuning (FAFT), which leverages both positive andnegative feedback, resulting in substantial gains over Supervised Fine-Tuning(SFT). Our findings offer in-depth insights to the community on various aspectsrelated to real-world planning applications.

 

Quick Read (beta)

loading the full paper ...