ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Abstract

The rapid advancement of large Vision-Language Models (VLMs) has propelledthe development of pure-vision-based GUI Agents, capable of perceiving andoperating Graphical User Interfaces (GUI) to autonomously fulfill userinstructions. However, existing approaches usually adopt an offline learningframework, which faces two core limitations: (1) heavy reliance on high-qualitymanual annotations for element grounding and action supervision, and (2)limited adaptability to dynamic and interactive environments. To address theselimitations, we propose ZeroGUI, a scalable, online learning framework forautomating GUI Agent training at Zero human cost. Specifically, ZeroGUIintegrates (i) VLM-based automatic task generation to produce diverse traininggoals from the current environment state, (ii) VLM-based automatic rewardestimation to assess task success without hand-crafted evaluation functions,and (iii) two-stage online reinforcement learning to continuously interact withand learn from GUI environments. Experiments on two advanced GUI Agents(UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performanceacross OSWorld and AndroidLab environments. The code is available athttps://github.com/OpenGVLab/ZeroGUI.

Quick Read (beta)

loading the full paper ...