Abstract
Graphical User Interface (GUI) action grounding is a critical step in GUIautomation that maps language instructions to actionable elements on GUIscreens. Most recent works of GUI action grounding leverage large GUI datasetsto fine-tune MLLMs. However, the fine-tuning data always covers limited GUIenvironments, and we find the performance of the resulting model deterioratesin novel environments. We argue that the GUI grounding models should be furtheraligned to the novel environments to reveal their full potential, when theinference is known to involve novel environments, i.e., environments not usedduring the previous fine-tuning. To realize this, we first propose GUI-Bee, anMLLM-based autonomous agent, to collect high-quality, environment-specific datathrough exploration and then continuously fine-tune GUI grounding models withthe collected data. Our agent leverages a novel Q-value-Incentive In-ContextReinforcement Learning (Q-ICRL) method to optimize exploration efficiency anddata quality. Additionally, we introduce NovelScreenSpot, a benchmark fortesting how well the data can help align GUI action grounding models to novelenvironments and demonstrate the effectiveness of data collected by GUI-Bee inthe experiments. Furthermore, we conduct an ablation study to validate theQ-ICRL method in enhancing the efficiency of GUI-Bee. Project page:https://gui-bee.github.io