GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior

  • 2025-06-09 18:59:57
  • Penghao Wu, Shengnan Ma, Bo Wang, Jiaheng Yu, Lewei Lu, Ziwei Liu
  • 0

Abstract

Multimodal Large Language Models (MLLMs) have shown great potential inrevolutionizing Graphical User Interface (GUI) automation. However, existingGUI models mostly rely on learning from nearly error-free offline trajectories,thus lacking reflection and error recovery capabilities. To bridge this gap, wepropose GUI-Reflection, a novel framework that explicitly integratesself-reflection and error correction capabilities into end-to-end multimodalGUI models throughout dedicated training stages: GUI-specific pre-training,offline supervised fine-tuning (SFT), and online reflection tuning.GUI-reflection enables self-reflection behavior emergence with fully automateddata generation and learning processes without requiring any human annotation.Specifically, 1) we first propose scalable data pipelines to automaticallyconstruct reflection and error correction data from existing successfultrajectories. While existing GUI models mainly focus on grounding and UIunderstanding ability, we propose the GUI-Reflection Task Suite to learn andevaluate reflection-oriented abilities explicitly. 2) Furthermore, we built adiverse and efficient environment for online training and data collection ofGUI models on mobile devices. 3) We also present an iterative online reflectiontuning algorithm leveraging the proposed environment, enabling the model tocontinuously enhance its reflection and error correction abilities. Ourframework equips GUI agents with self-reflection and correction capabilities,paving the way for more robust, adaptable, and intelligent GUI automation, withall data, models, environments, and tools to be released publicly.

 

Quick Read (beta)

loading the full paper ...