Multimodal Reference Visual Grounding

  • 2025-09-24 17:23:48
  • Yangxiao Lu, Ruosen Li, Liqiang Jing, Jikai Wang, Xinya Du, Yunhui Guo, Nicholas Ruozzi, Yu Xiang
  • 0

Abstract

Visual grounding focuses on detecting objects from images based on languageexpressions. Recent Large Vision-Language Models (LVLMs) have significantlyadvanced visual grounding performance by training large models with large-scaledatasets. However, the problem remains challenging, especially when similarobjects appear in the input image. For example, an LVLM may not be able todifferentiate Diet Coke and regular Coke in an image. In this case, ifadditional reference images of Diet Coke and regular Coke are available, it canhelp the visual grounding of similar objects. In this work, we introduce a new task named Multimodal Reference VisualGrounding (MRVG). In this task, a model has access to a set of reference imagesof objects in a database. Based on these reference images and a languageexpression, the model is required to detect a target object from a query image.We first introduce a new dataset to study the MRVG problem. Then we introduce anovel method, named MRVG-Net, to solve this visual grounding problem. We showthat by efficiently using reference images with few-shot object detection andusing Large Language Models (LLMs) for object matching, our method achievessuperior visual grounding performance compared to the state-of-the-art LVLMssuch as Qwen2.5-VL-72B. Our approach bridges the gap between few-shot detectionand visual grounding, unlocking new capabilities for visual understanding,which has wide applications in robotics. Project page with our video, code, anddataset: https://irvlutd.github.io/MultiGrounding

 

Quick Read (beta)

loading the full paper ...