Cross-Image Contrastive Decoding: Precise, Lossless Suppression of Language Priors in Large Vision-Language Models

Abstract

Language priors are a major cause of hallucinations in Large Vision-LanguageModels (LVLMs), often leading to text that is linguistically plausible butvisually inconsistent. Recent work explores contrastive decoding as atraining-free solution, but these methods typically construct negative contextsfrom the original image, resulting in visual information loss and distorteddistribution. Motivated by the observation that language priors stem from theLLM backbone and remain consistent across images, we propose Cross-ImagesContrastive Decoding (CICD), a simple yet effective training-free method thatuses different images to construct negative contexts. We further analyze thecross-image behavior of language priors and introduce a distinction betweenessential priors (supporting fluency) and detrimental priors (causinghallucinations). By selectively preserving essential priors and suppressingdetrimental ones, our method reduces hallucinations while maintaining coherentand fluent language generation. Experiments on 4 benchmarks and 6 LVLMs acrossthree model families confirm the effectiveness and generalizability of CICD,especially in image captioning, where language priors are particularlypronounced. Code will be released once accepted.

Quick Read (beta)

loading the full paper ...