OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Abstract

Autonomous agents that accomplish complex computer tasks with minimal humaninterventions have the potential to transform human-computer interaction,significantly enhancing accessibility and productivity. However, existingbenchmarks either lack an interactive environment or are limited toenvironments specific to certain applications or domains, failing to reflectthe diverse and complex nature of real-world computer use, thereby limiting thescope of tasks and agent scalability. To address this issue, we introduceOSWorld, the first-of-its-kind scalable, real computer environment formultimodal agents, supporting task setup, execution-based evaluation, andinteractive learning across various operating systems such as Ubuntu, Windows,and macOS. OSWorld can serve as a unified, integrated computer environment forassessing open-ended computer tasks that involve arbitrary applications.Building upon OSWorld, we create a benchmark of 369 computer tasks involvingreal web and desktop apps in open domains, OS file I/O, and workflows spanningmultiple applications. Each task example is derived from real-world computeruse cases and includes a detailed initial state setup configuration and acustom execution-based evaluation script for reliable, reproducible evaluation.Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorldreveals significant deficiencies in their ability to serve as computerassistants. While humans can accomplish over 72.36% of the tasks, the bestmodel achieves only 12.24% success, primarily struggling with GUI grounding andoperational knowledge. Comprehensive analysis using OSWorld provides valuableinsights for developing multimodal generalist agents that were not possiblewith previous benchmarks. Our code, environment, baseline models, and data arepublicly available at https://os-world.github.io.

Quick Read (beta)

loading the full paper ...