Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation

Abstract

Vision language models (VLMs) often generate hallucination, i.e., contentthat cannot be substantiated by either textual or visual inputs. Prior workprimarily attributes this to over-reliance on linguistic prior knowledge ratherthan visual inputs. Some methods attempt to mitigate hallucination byamplifying visual token attention proportionally to their attention scores.However, these methods overlook the visual attention sink problem, whereattention is frequently misallocated to task-irrelevant visual regions, andneglect cross-modal fusion balance by enhancing only visual attention withoutadjusting attention to the user query. This can result in amplifying incorrectareas while failing to properly interpret the user query. To address thesechallenges, we propose a simple yet effective method called Gaze Shift-GuidedCross-modal Fusion Enhancement (GIFT). GIFT pre-computes a holistic visualsaliency map by tracking positive changes in visual attention, or "gazeshifts", during user query comprehension, and leverages this map to amplifyattention to both salient visual information and the user query at eachdecoding step. This reduces the impact of visual attention sink, as irrelevanttokens exhibit minimal shifts, while ensuring balanced cross-modal fusion forwell-integrated representation. Extensive experiments show that GIFTeffectively mitigates hallucination in VLMs across both generative andclassification tasks, achieving up to 20.7% improvement over greedy decoding,while maintaining general vision-language performance with low computationaloverhead.

Quick Read (beta)

loading the full paper ...