Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

  • 2025-10-22 04:30:24
  • Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang
  • 0

Abstract

While Multimodal Large Language Models (MLLMs) excel at holisticunderstanding, they struggle in capturing the dense world with complex scenes,requiring fine-grained analysis of intricate details and objectinter-relationships. Region-level MLLMs have been a promising step. However,previous attempts are generally optimized to understand given regions inisolation, neglecting crucial global contexts. To address this, we introduceGrasp Any Region (GAR) for comprehen- sive region-level visual understanding.Empowered by an effective RoI-aligned feature replay technique, GAR supports(1) precise perception by leveraging necessary global contexts, and (2)modeling interactions between multiple prompts. Together, it then naturallyachieves (3) advanced compositional reasoning to answer specific free-formquestions about any region, shifting the paradigm from passive description toactive dialogue. Moreover, we construct GAR-Bench, which not only provides amore accurate evaluation of single-region comprehension, but also, moreimportantly, measures interactions and complex reasoning across multipleregions. Extensive experiments have demonstrated that GAR-1B not only maintainsthe state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5on DLC-Bench, but also excels at modeling relationships between multipleprompts with advanced comprehension capabilities, even surpassing InternVL3-78Bon GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperformsin-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strongcapabilities can be easily transferred to videos.

 

Quick Read (beta)

loading the full paper ...