UI-Venus Technical Report: Building High-performance UI Agents with RFT

Abstract

We present UI-Venus, a native UI agent that takes only screenshots as inputbased on a multimodal large language model. UI-Venus achieves SOTA performanceon both UI grounding and navigation tasks using only several hundred thousandhigh-quality training samples through reinforcement finetune (RFT) based onQwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% /50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e.,Screenspot-V2 / Pro, surpassing the previous SOTA baselines includingopen-source GTA1 and closed-source UI-TARS-1.5. To show UI-Venus's summary andplaning ability, we also evaluate it on the AndroidWorld, an online UInavigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9%success rate, also beating existing models. To achieve this, we introducecarefully designed reward functions for both UI grounding and navigation tasksand corresponding efficient data cleaning strategies. To further boostnavigation performance, we propose Self-Evolving Trajectory History Alignment &Sparse Action Enhancement that refine historical reasoning traces and balancesthe distribution of sparse but critical actions, leading to more coherentplanning and better generalization in complex UI tasks. Our contributionsinclude the publish of SOTA open-source UI agents, comprehensive data cleaningprotocols and a novel self-evolving framework for improving navigationperformance, which encourage further research and development in the community.Code is available at https://github.com/inclusionAI/UI-Venus.

Quick Read (beta)

loading the full paper ...