Abstract
Graphical User Interface (GUI) grounding, the task of mapping naturallanguage instructions to precise screen coordinates, is fundamental toautonomous GUI agents. While existing methods achieve strong performancethrough extensive supervised training or reinforcement learning with labeledrewards, they remain constrained by the cost and availability of pixel-levelannotations. We observe that when models generate multiple predictions for thesame GUI element, the spatial overlap patterns reveal implicit confidencesignals that can guide more accurate localization. Leveraging this insight, wepropose GUI-RC (Region Consistency), a test-time scaling method that constructsspatial voting grids from multiple sampled predictions to identify consensusregions where models show highest agreement. Without any training, GUI-RCimproves accuracy by 2-3% across various architectures on ScreenSpotbenchmarks. We further introduce GUI-RCPO (Region Consistency PolicyOptimization), which transforms these consistency patterns into rewards fortest-time reinforcement learning. By computing how well each prediction alignswith the collective consensus, GUI-RCPO enables models to iteratively refinetheir outputs on unlabeled data during inference. Extensive experimentsdemonstrate the generality of our approach: GUI-RC boostsQwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPOfurther improves it to 85.14% through self-supervised optimization. Ourapproach reveals the untapped potential of test-time scaling and test-timereinforcement learning for GUI grounding, offering a promising path toward morerobust and data-efficient GUI agents.