Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding

Abstract

Large vision-language models (LVLMs) have shown remarkable capabilities invisual-language understanding for downstream multi-modal tasks. Despite theirsuccess, LVLMs still suffer from generating hallucinations in complexgeneration tasks, leading to inconsistencies between visual inputs andgenerated content. To address this issue, some approaches have introducedinference-time interventions, such as contrastive decoding and attentionrectification, to reduce overreliance on language priors. However, theseapproaches overlook hallucinations stemming from spurious inter-modalitycorrelations. In this paper, we propose an Inter-Modality CorrelationCalibration Decoding (IMCCD) method to mitigate hallucinations in LVLMs in atraining-free manner. In this method, we design a Cross-Modal Value-EnhancedDecoding(CMVED) module to alleviate hallucination by a novel contrastivedecoding mechanism. During the estimation of distorted distribution, CMVEDmasks the value vectors associated with significant cross-modal attentionweights, which address both uni-modality overreliance and misleadinginter-modality correlations. Additionally, a Content-Driven AttentionRefinement(CDAR) module refines cross-modal attention weights, guiding LVLMs tofocus on important visual content. Experimental results on diversehallucination benchmarks validate the superiority of our method over existingstate-of-the-art techniques in reducing hallucinations in LVLM text generation.Our code will be available at https://github.com/lijm48/IMCCD.

Quick Read (beta)

loading the full paper ...