Abstract
Vision-language models have demonstrated impressive capabilities ascomputer-use agents (CUAs) capable of automating diverse computer tasks. Astheir commercial potential grows, critical details of the most capable CUAsystems remain closed. As these agents will increasingly mediate digitalinteractions and execute consequential decisions on our behalf, the researchcommunity needs access to open CUA frameworks to study their capabilities,limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensiveopen-source framework for scaling CUA data and foundation models. Our frameworkconsists of: (1) an annotation infrastructure that seamlessly captures humancomputer-use demonstrations; (2) AgentNet, the first large-scale computer-usetask dataset spanning 3 operating systems and 200+ applications and websites;(3) a scalable pipeline that transforms demonstrations into state-action pairswith reflective long Chain-of-Thought reasoning that sustain robust performancegains as data scales. Our end-to-end agent models demonstrate strongperformance across CUA benchmarks. In particular, OpenCUA-32B achieves anaverage success rate of 34.8% on OSWorld-Verified, establishing a newstate-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA(GPT-4o). Further analysis confirms that our approach generalizes well acrossdomains and benefits significantly from increased test-time computation. Werelease our annotation tool, datasets, code, and models to build openfoundations for further CUA research.