Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs

Abstract

Existing Large Vision-Language Models (LVLMs) primarily align image featuresof vision encoder with Large Language Models (LLMs) to leverage their superiortext generation capabilities. However, the scale disparity between visionencoder and language model may led to LLMs assuming a predominant role inmulti-modal comprehension. This imbalance in LVLMs may result in the instancesof hallucinatory. Concretely, LVLMs may generate consistent descriptions withor without visual input, indicating that certain outputs are influenced solelyby context text. We refer to this phenomenon as "text inertia." To counteractthis issue, we introduce a training-free algorithm to find an equilibrium pointbetween image comprehension and language inference. Specifically, we adaptivelyinvolve adjusting and amplifying the attention weights assigned to imagetokens, thereby granting greater prominence to visual elements. Meanwhile, wesubtract the logits of multi-modal inputs from ones of pure text input, whichcan help LVLMs be not biased towards LLMs. By enhancing images tokens andreducing the stubborn output of LLM, we can let LVLM pay more attention toimages, towards alleviating text inertia and reducing the hallucination inLVLMs. Our extensive experiments shows that this method substantially reducesthe frequency of hallucinatory outputs in various LVLMs in terms of differentmetrics. Project page is available at https://lalbj.github.io/projects/PAI/.

Quick Read (beta)

loading the full paper ...