Abstract
Multimodal Large Language Model (MLLM) often suffer from hallucinations. Theyover-rely on partial cues and generate incorrect responses. Recently, methodslike Visual Contrastive Decoding (VCD) and Instruction Contrastive Decoding(ICD) have been proposed to mitigate hallucinations by contrasting predictionsfrom perturbed or negatively prefixed inputs against original outputs. In thiswork, we uncover that methods like VCD and ICD fundamentally influence internalattention dynamics of the model. This observation suggests that theireffectiveness may not stem merely from surface-level modifications to logitsbut from deeper shifts in attention distribution. Inspired by this insight, wepropose an attention-steerable contrastive decoding framework that directlyintervenes in attention mechanisms of the model to offer a more principledapproach to mitigating hallucinations. Our experiments across multiple MLLMarchitectures and diverse decoding methods demonstrate that our approachsignificantly reduces hallucinations and improves the performance on benchmarkssuch as POPE, CHAIR, and MMHal-Bench, while simultaneously enhancingperformance on standard VQA benchmarks.