UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

Abstract

The emergence of Multimodal Large Language Models (MLLMs) has drivensignificant advances in Graphical User Interface (GUI) agent capabilities.Nevertheless, existing GUI agent training and inference techniques still sufferfrom a dilemma for reasoning designs, ineffective reward, and visual noise. Toaddress these issues, we introduce UI-AGILE, a comprehensive frameworkenhancing GUI agents at both the training and inference stages. For training,we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process:1) a Continuous Reward function to incentivize high-precision grounding; 2) a"Simple Thinking" reward to balance planning with speed and grounding accuracy;and 3) a Cropping-based Resampling strategy to mitigate the sparse rewardproblem and improve learning on complex tasks. For inference, we presentDecomposed Grounding with Selection, a novel method that dramatically improvesgrounding accuracy on high-resolution displays by breaking the image intosmaller, manageable parts. Experiments show that UI-AGILE achieves thestate-of-the-art performance on two benchmarks ScreenSpot-Pro andScreenSpot-v2. For instance, using both our proposed training and inferenceenhancement methods brings 23% grounding accuracy improvement over the bestbaseline on ScreenSpot-Pro.

Quick Read (beta)

loading the full paper ...