Abstract
Modern web agents possess computer use abilities that allow them to interactwith webpages by sending commands to a virtual keyboard and mouse. While suchagents have considerable potential to assist human users with complex tasks,evaluating their capabilities in real-world settings poses a major challenge.To this end, we introduce BEARCUBS, a "smallbut mighty" benchmark of 111information-seeking questions designed to evaluate a web agent's ability tosearch, browse, and identify factual information from the web. Unlike prior webagent benchmarks, solving BEARCUBS requires (1) accessing live web contentrather than synthetic or simulated pages, which captures the unpredictabilityof real-world web interactions; and (2) performing a broad range of multimodalinteractions (e.g., video understanding, 3D navigation) that cannot be bypassedvia text-based workarounds. Each question in BEARCUBS has a correspondingshort, unambiguous answer and a human-validated browsing trajectory, allowingfor transparent evaluation of agent performance and strategies. A human studyconfirms that BEARCUBS questions are solvable but non-trivial (84.7% humanaccuracy), revealing domain knowledge gaps and overlooked details as commonfailure points. We find that ChatGPT Agent significantly outperforms othercomputer-using agents with an overall accuracy of 65.8% (compared to e.g.,Operator's 23.4%), showcasing substantial progress in tasks involving realcomputer use, such as playing web games and navigating 3D environments.Nevertheless, closing the gap to human performance requires improvements inareas like fine control, complex data filtering, and execution speed. Tofacilitate future research, BEARCUBS will be updated periodically to replaceinvalid or contaminated questions, keeping the benchmark fresh for futuregenerations of web agents.