Abstract
Large Vision-Language Model (LVLM) systems have demonstrated impressivevision-language reasoning capabilities but suffer from pervasive and severehallucination issues, posing significant risks in critical domains such ashealthcare and autonomous systems. Despite previous efforts to mitigatehallucinations, a persistent issue remains: visual defect from vision-languagemisalignment, creating a bottleneck in visual processing capacity. To addressthis challenge, we develop Complementary Adaptive Token-level ContrastiveDecoding to Mitigate Hallucinations in LVLMs (CATCH), based on the InformationBottleneck theory. CATCH introduces Complementary Visual Decoupling (CVD) forvisual information separation, Non-Visual Screening (NVS) for hallucinationdetection, and Adaptive Token-level Contrastive Decoding (ATCD) forhallucination mitigation. CATCH addresses issues related to visual defects thatcause diminished fine-grained feature perception and cumulative hallucinationsin open-ended scenarios. It is applicable to various visual question-answeringtasks without requiring any specific data or prior knowledge, and generalizesrobustly to new tasks without additional training, opening new possibilitiesfor advancing LVLM in various challenging applications.