Abstract
The BrowserGym ecosystem addresses the growing need for efficient evaluationand benchmarking of web agents, particularly those leveraging automation andLarge Language Models (LLMs) for web interaction tasks. Many existingbenchmarks suffer from fragmentation and inconsistent evaluation methodologies,making it challenging to achieve reliable comparisons and reproducible results.BrowserGym aims to solve this by providing a unified, gym-like environment withwell-defined observation and action spaces, facilitating standardizedevaluation across diverse benchmarks. Combined with AgentLab, a complementaryframework that aids in agent creation, testing, and analysis, BrowserGym offersflexibility for integrating new benchmarks while ensuring consistent evaluationand comprehensive experiment management. This standardized approach seeks toreduce the time and complexity of developing web agents, supporting morereliable comparisons and facilitating in-depth analysis of agent behaviors, andcould result in more adaptable, capable agents, ultimately acceleratinginnovation in LLM-driven automation. As a supporting evidence, we conduct thefirst large-scale, multi-benchmark web agent experiment and compare theperformance of 6 state-of-the-art LLMs across all benchmarks currentlyavailable in BrowserGym. Among other findings, our results highlight a largediscrepancy between OpenAI and Anthropic's latests models, withClaude-3.5-Sonnet leading the way on almost all benchmarks, except onvision-related tasks where GPT-4o is superior. Despite these advancements, ourresults emphasize that building robust and efficient web agents remains asignificant challenge, due to the inherent complexity of real-world webenvironments and the limitations of current models.