Abstract
Large Language Models (LLMs) and Vision Language Models (VLMs) possessextensive knowledge and exhibit promising reasoning abilities; however, theystill struggle to perform well in complex, dynamic environments. Real-worldtasks require handling intricate interactions, advanced spatial reasoning,long-term planning, and continuous exploration of new strategies-areas in whichwe lack effective methodologies for comprehensively evaluating thesecapabilities. To address this gap, we introduce BALROG, a novel benchmarkdesigned to assess the agentic capabilities of LLMs and VLMs through a diverseset of challenging games. Our benchmark incorporates a range of existingreinforcement learning environments with varying levels of difficulty,including tasks that are solvable by non-expert humans in seconds to extremelychallenging ones that may take years to master (e.g., the NetHack LearningEnvironment). We devise fine-grained metrics to measure performance and conductan extensive evaluation of several popular open-source and closed-source LLMsand VLMs. Our findings indicate that while current models achieve partialsuccess in the easier games, they struggle significantly with more challengingtasks. Notably, we observe severe deficiencies in vision-based decision-making,as models perform worse when visual representations of the environments areprovided. We release BALROG as an open and user-friendly benchmark tofacilitate future research and development in the agentic community.