Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Abstract

Large vision-language models (LVLMs) have achieved impressive results invarious vision-language tasks. However, despite showing promising performance,LVLMs suffer from hallucinations caused by language bias, leading to diminishedfocus on images and ineffective visual comprehension. We identify two primaryreasons for this bias: 1. Different scales of training data between thepretraining stage of LLM and multimodal alignment stage. 2. The learnedinference bias due to short-term dependency of text data. Therefore, we proposeLACING, a systemic framework designed to address the language bias of LVLMswith muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG).Specifically, MDA introduces a parallel dual-attention mechanism that enhancesthe integration of visual inputs across the model. IFG introduces a learnablesoft visual prompt during training and inference to replace visual inputs,designed to compel LVLMs to prioritize text inputs. Then, IFG further proposesa novel decoding strategy using the soft visual prompt to mitigate the model'sover-reliance on adjacent text inputs. Comprehensive experiments demonstratethat our method effectively debiases LVLMs from their language bias, enhancingvisual comprehension and reducing hallucinations without requiring additionaltraining resources or data. The code and model are available at[lacing-lvlm.github.io](https://lacing-lvlm.github.io).

Quick Read (beta)

loading the full paper ...