LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

Abstract

Visual grounding is an essential tool that links user-provided text querieswith query-specific regions within an image. Despite advancements in visualgrounding models, their ability to comprehend complex queries remains limited.To overcome this limitation, we introduce LLM-Optic, an innovative method thatutilizes Large Language Models (LLMs) as an optical lens to enhance existingvisual grounding models in comprehending complex text queries involvingintricate text structures, multiple objects, or object spatial relationships,situations that current models struggle with. LLM-Optic first employs an LLM asa Text Grounder to interpret complex text queries and accurately identifyobjects the user intends to locate. Then a pre-trained visual grounding modelis used to generate candidate bounding boxes given the refined query by theText Grounder. After that, LLM-Optic annotates the candidate bounding boxeswith numerical marks to establish a connection between text and specific imageregions, thereby linking two distinct modalities. Finally, it employs a LargeMultimodal Model (LMM) as a Visual Grounder to select the marked candidateobjects that best correspond to the original text query. Through LLM-Optic, wehave achieved universal visual grounding, which allows for the detection ofarbitrary objects specified by arbitrary human language input. Importantly, ourmethod achieves this enhancement without requiring additional training orfine-tuning. Extensive experiments across various challenging benchmarksdemonstrate that LLM-Optic achieves state-of-the-art zero-shot visual groundingcapabilities. Project Page: https://haoyu-zhao.github.io/LLM-Optic.github.io/.

Quick Read (beta)

loading the full paper ...