FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis

Abstract

Object Referring Analysis (ORA), commonly known as referring expressioncomprehension, requires the identification and localization of specific objectsin an image based on natural descriptions. Unlike generic object detection, ORArequires both accurate language understanding and precise visual localization,making it inherently more complex. Although recent pre-trained large visualgrounding detectors have achieved significant progress, they heavily rely onextensively labeled data and time-consuming learning. To address these, weintroduce a novel, training-free framework for zero-shot ORA, termed FLORA(Formal Language for Object Referring and Analysis). FLORA harnesses theinherent reasoning capabilities of large language models (LLMs) and integratesa formal language model - a logical framework that regulates language withinstructured, rule-based descriptions - to provide effective zero-shot ORA. Morespecifically, our formal language model (FLM) enables an effective,logic-driven interpretation of object descriptions without necessitating anytraining processes. Built upon FLM-regulated LLM outputs, we further devise aBayesian inference framework and employ appropriate off-the-shelf interpretivemodels to finalize the reasoning, delivering favorable robustness against LLMhallucinations and compelling ORA performance in a training-free manner. Inpractice, our FLORA boosts the zero-shot performance of existing pretrainedgrounding detectors by up to around 45%. Our comprehensive evaluation acrossdifferent challenging datasets also confirms that FLORA consistently surpassescurrent state-of-the-art zero-shot methods in both detection and segmentationtasks associated with zero-shot ORA. We believe our probabilistic parsing andreasoning of the LLM outputs elevate the reliability and interpretability ofzero-shot ORA. We shall release codes upon publication.

Quick Read (beta)

loading the full paper ...