Grounding Computer Use Agents on Human Demonstrations

Abstract

Building reliable computer-use agents requires grounding: accuratelyconnecting natural language instructions to the correct on-screen elements.While large datasets exist for web and mobile interactions, high-qualityresources for desktop environments are limited. To address this gap, weintroduce GroundCUA, a large-scale desktop grounding dataset built from experthuman demonstrations. It covers 87 applications across 12 categories andincludes 56K screenshots, with every on-screen element carefully annotated fora total of over 3.56M human-verified annotations. From these demonstrations, wegenerate diverse instructions that capture a wide range of real-world tasks,providing high-quality data for model training. Using GroundCUA, we develop theGroundNext family of models that map instructions to their target UI elements.At both 3B and 7B scales, GroundNext achieves state-of-the-art results acrossfive benchmarks using supervised fine-tuning, while requiring less thanone-tenth the training data of prior work. Reinforcement learning post-trainingfurther improves performance, and when evaluated in an agentic setting on theOSWorld benchmark using o3 as planner, GroundNext attains comparable orsuperior results to models trained with substantially more data,. These resultsdemonstrate the critical role of high-quality, expert-driven datasets inadvancing general-purpose computer-use agents.

Quick Read (beta)

loading the full paper ...