GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Abstract

Existing efforts in building Graphical User Interface (GUI) agents largelyrely on the training paradigm of supervised fine-tuning on LargeVision-Language Models (LVLMs). However, this approach not only demandsextensive amounts of training data but also struggles to effectively understandGUI screenshots and generalize to unseen interfaces. The issue significantlylimits its application in real-world scenarios, especially for high-leveltasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models(e.g., DeepSeek-R1), which efficiently enhances the problem-solvingcapabilities of large language models in real-world settings, we propose \name,the first reinforcement learning framework designed to enhance the GUIcapabilities of LVLMs in high-level real-world task scenarios, through unifiedaction space rule modeling. By leveraging a small amount of carefully curatedhigh-quality data across multiple platforms (including Windows, Linux, MacOS,Android, and Web) and employing policy optimization algorithms such as GroupRelative Policy Optimization (GRPO) to update the model, \name achievessuperior performance using only 0.02\% of the data (3K vs. 13M) compared toprevious state-of-the-art methods like OS-Atlas across eight benchmarksspanning three different platforms (mobile, desktop, and web). These resultsdemonstrate the immense potential of reinforcement learning based on unifiedaction space rule modeling in improving the execution capabilities of LVLMs forreal-world GUI agent tasks.

Quick Read (beta)

loading the full paper ...