Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Abstract

Large language models (LLMs) show remarkable potential to act as computeragents, enhancing human productivity and software accessibility in multi-modaltasks that require planning and reasoning. However, measuring agent performancein realistic environments remains a challenge since: (i) most benchmarks arelimited to specific modalities or domains (e.g. text-only, web navigation, Q&A,coding) and (ii) full benchmark evaluations are slow (on order of magnitude ofdays) given the multi-step sequential nature of tasks. To address thesechallenges, we introduce the Windows Agent Arena: a reproducible, generalenvironment focusing exclusively on the Windows operating system (OS) whereagents can operate freely within a real Windows OS and use the same wide rangeof applications, tools, and web browsers available to human users when solvingtasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverseWindows tasks across representative domains that require agent abilities inplanning, screen understanding, and tool usage. Our benchmark is scalable andcan be seamlessly parallelized in Azure for a full benchmark evaluation in aslittle as 20 minutes. To demonstrate Windows Agent Arena's capabilities, wealso introduce a new multi-modal agent, Navi. Our agent achieves a success rateof 19.5% in the Windows domain, compared to 74.5% performance of an unassistedhuman. Navi also demonstrates strong performance on another popular web-basedbenchmark, Mind2Web. We offer extensive quantitative and qualitative analysisof Navi's performance, and provide insights into the opportunities for futureresearch in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena

Quick Read (beta)

loading the full paper ...