VGR: Visual Grounded Reasoning

Abstract

In the field of multimodal chain-of-thought (CoT) reasoning, existingapproaches predominantly rely on reasoning on pure language space, whichinherently suffers from language bias and is largely confined to math orscience domains. This narrow focus limits their ability to handle complexvisual reasoning tasks that demand comprehensive understanding of imagedetails. To address these limitations, this paper introduces VGR, a novelreasoning multimodal large language model (MLLM) with enhanced fine-grainedvisual perception capabilities. Unlike traditional MLLMs that answer thequestion or reasoning solely on the language space, our VGR first detectsrelevant regions that may help to solve problems, and then provides preciseanswers based on replayed image regions. To achieve this, we conduct alarge-scale SFT dataset called VGR -SFT that contains reasoning data with mixedvision grounding and language deduction. The inference pipeline of VGR allowsthe model to choose bounding boxes for visual reference and a replay stage isintroduced to integrates the corresponding regions into the reasoning process,enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baselineshow that VGR achieves superior performance on multi-modal benchmarks requiringcomprehensive image detail understanding. Compared to the baseline, VGR usesonly 30\% of the image token count while delivering scores of +4.1 on MMStar,+7.1 on AI2D, and a +12.9 improvement on ChartQA.

Quick Read (beta)

loading the full paper ...