Abstract
Graphical user interfaces (GUIs) are the primary medium for human-computerinteraction, yet automating GUI interactions remains challenging due to thecomplexity of visual elements, dynamic environments, and the need formulti-step reasoning. Existing methods based on vision-language models (VLMs)often suffer from limited resolution, domain mismatch, and insufficientsequential decisionmaking capability. To address these issues, we propose Mano,a robust GUI agent built upon a multi-modal foundation model pre-trained onextensive web and computer system data. Our approach integrates a novelsimulated environment for high-fidelity data generation, a three-stage trainingpipeline (supervised fine-tuning, offline reinforcement learning, and onlinereinforcement learning), and a verification module for error recovery. Manodemonstrates state-of-the-art performance on multiple GUI benchmarks, includingMind2Web and OSWorld, achieving significant improvements in success rate andoperational accuracy. Our work provides new insights into the effectiveintegration of reinforcement learning with VLMs for practical GUI agentdeployment, highlighting the importance of domain-specific data, iterativetraining, and holistic reward design.