BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

  • 2024-11-20 18:54:32
  • Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel
  • 0

Abstract

Large Language Models (LLMs) and Vision Language Models (VLMs) possessextensive knowledge and exhibit promising reasoning abilities; however, theystill struggle to perform well in complex, dynamic environments. Real-worldtasks require handling intricate interactions, advanced spatial reasoning,long-term planning, and continuous exploration of new strategies-areas in whichwe lack effective methodologies for comprehensively evaluating thesecapabilities. To address this gap, we introduce BALROG, a novel benchmarkdesigned to assess the agentic capabilities of LLMs and VLMs through a diverseset of challenging games. Our benchmark incorporates a range of existingreinforcement learning environments with varying levels of difficulty,including tasks that are solvable by non-expert humans in seconds to extremelychallenging ones that may take years to master (e.g., the NetHack LearningEnvironment). We devise fine-grained metrics to measure performance and conductan extensive evaluation of several popular open-source and closed-source LLMsand VLMs. Our findings indicate that while current models achieve partialsuccess in the easier games, they struggle significantly with more challengingtasks. Notably, we observe severe deficiencies in vision-based decision-making,as models perform worse when visual representations of the environments areprovided. We release BALROG as an open and user-friendly benchmark tofacilitate future research and development in the agentic community.

 

Quick Read (beta)

loading the full paper ...