Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

Abstract

Large Language Models (LLMs) based agent systems have made great strides inreal-world applications beyond traditional NLP tasks. This paper proposes a newLLM-based Multi-Agent System (LLM-MAS) benchmark, Collab-Overcooked, built onthe popular Overcooked-AI game with more applicable and challenging tasks ininteractive environments. Collab-Overcooked extends existing benchmarks in twonovel ways. First, it provides a multi-agent framework supporting diverse tasksand objectives and encourages collaboration through natural languagecommunication. Second, it introduces a spectrum of process-oriented evaluationmetrics to assess the fine-grained collaboration capabilities of different LLMagents, a dimension often overlooked in prior work. We conduct extensiveexperiments with 13 popular LLMs and show that, while the LLMs exhibit a strongability in goal interpretation, there are significant shortcomings in activecollaboration and continuous adaptation, which are critical for efficientlyfulfilling complex tasks. Notably, we highlight the strengths and weaknesses ofLLM-MAS and provide insights for improving and evaluating LLM-MAS on a unifiedand open-source benchmark. The environments, 30 open-ended tasks, and theevaluation package are publicly available athttps://github.com/YusaeMeow/Collab-Overcooked.

Quick Read (beta)

loading the full paper ...