VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework

  • 2025-09-02 05:28:29
  • Chao Wang, Chunbai Zhang, Yongxiao Tian, Yang Zhou, Yan Peng
  • 0

Abstract

Visual reasoning refers to the task of solving questions about visualinformation. Current visual reasoning methods typically employ pre-trainedvision-language model (VLM) strategies or deep neural network approaches.However, existing efforts are constrained by limited reasoninginterpretability, while hindering by the phenomenon of underspecification inthe question text. Additionally, the absence of fine-grained visual knowledgelimits the precise understanding of subject behavior in visual reasoning tasks.To address these issues, we propose VIKSER (Visual Knowledge-DrivenSelf-Reinforcing Reasoning Framework). Specifically, VIKSER, trained usingknowledge distilled from large language models, extracts fine-grained visualknowledge with the assistance of visual relationship detection techniques.Subsequently, VIKSER utilizes fine-grained visual knowledge to paraphrase thequestion with underspecification. Additionally, we design a novel promptingmethod called Chain-of-Evidence (CoE), which leverages the power of "evidencefor reasoning" to endow VIKSER with interpretable reasoning capabilities.Meanwhile, the integration of self-reflection technology empowers VIKSER withthe ability to learn and improve from its mistakes. Experiments conducted onwidely used datasets demonstrate that VIKSER achieves new state-of-the-art(SOTA) results in relevant tasks. Moreover, VIKSER achieves performance on parwith leading proprietary models, such as the latest ChatGPT-5.

 

Quick Read (beta)

loading the full paper ...