Mitigate Language Priors in Large Vision-Language Models by Cross-Images Contrastive Decoding

Abstract

Language priors are a major cause of hallucinations in Large Vision-LanguageModels (LVLMs), often leading to text that is linguistically plausible butvisually inconsistent. Recent work explores contrastive decoding as atraining-free solution, but these methods typically construct negative visualcontexts from the original image, resulting in visual information loss anddistorted distribution. Motivated by the observation that language priors stemfrom the LLM backbone and remain consistent across images, we proposeCross-Images Contrastive Decoding (CICD), a simple yet effective training-freemethod that uses different images to construct negative visual contexts. Wefurther analyze the cross-image behavior of language priors and introduce adistinction between essential priors (supporting fluency) and detrimentalpriors (causing hallucinations), enabling selective suppression. By selectivelypreserving essential priors and suppressing detrimental ones, our methodreduces hallucinations while maintaining coherent and fluent languagegeneration. Experiments on four benchmarks and six LVLMs across three modelfamilies confirm the effectiveness and generalizability of CICD, especially inimage captioning, where language priors are particularly pronounced. Code willbe released upon acceptance.

Quick Read (beta)

loading the full paper ...