Abstract
Repurposing large vision-language models (LVLMs) as computer use agents(CUAs) has led to substantial breakthroughs, primarily driven by human-labeleddata. However, these models often struggle with novel and specialized software,particularly in scenarios lacking human annotations. To address this challenge,we propose SEAgent, an agentic self-evolving framework enabling CUAs toautonomously evolve through interactions with unfamiliar software.Specifically, SEAgent empowers computer-use agents to autonomously master novelsoftware environments via experiential learning, where agents explore newsoftware, learn through iterative trial-and-error, and progressively tackleauto-generated tasks organized from simple to complex. To achieve this goal, wedesign a World State Model for step-wise trajectory assessment, along with aCurriculum Generator that generates increasingly diverse and challenging tasks.The agent's policy is updated through experiential learning, comprised ofadversarial imitation of failure actions and Group Relative Policy Optimization(GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalisttraining strategy that integrates individual experiential insights fromspecialist agents, facilitating the development of a stronger generalist CUAcapable of continuous autonomous evolution. This unified agent ultimatelyachieves performance surpassing ensembles of individual specialist agents ontheir specialized software. We validate the effectiveness of SEAgent acrossfive novel software environments within OS-World. Our approach achieves asignificant improvement of 23.2% in success rate, from 11.3% to 34.5%, over acompetitive open-source CUA, i.e., UI-TARS.