CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs

Abstract

Large Vision-Language Model (LVLM) systems have demonstrated impressivevision-language reasoning capabilities but suffer from pervasive and severehallucination issues, posing significant risks in critical domains such ashealthcare and autonomous systems. Despite previous efforts to mitigatehallucinations, a persistent issue remains: visual defect from vision-languagemisalignment, creating a bottleneck in visual processing capacity. To addressthis challenge, we develop Complementary Adaptive Token-level ContrastiveDecoding to Mitigate Hallucinations in LVLMs (CATCH), based on the InformationBottleneck theory. CATCH introduces Complementary Visual Decoupling (CVD) forvisual information separation, Non-Visual Screening (NVS) for hallucinationdetection, and Adaptive Token-level Contrastive Decoding (ATCD) forhallucination mitigation. CATCH addresses issues related to visual defects thatcause diminished fine-grained feature perception and cumulative hallucinationsin open-ended scenarios. It is applicable to various visual question-answeringtasks without requiring any specific data or prior knowledge, and generalizesrobustly to new tasks without additional training, opening new possibilitiesfor advancing LVLM in various challenging applications.

Quick Read (beta)

loading the full paper ...