Abstract
Large Vision-Language Models (LVLMs) have demonstrated remarkablecapabilities in visual understanding and multimodal reasoning. However, LVLMsfrequently exhibit hallucination phenomena, manifesting as the generatedtextual responses that demonstrate inconsistencies with the provided visualcontent. Existing hallucination mitigation methods are predominantlytext-centric, the challenges of visual-semantic alignment significantly limittheir effectiveness, especially when confronted with fine-grained visualunderstanding scenarios. To this end, this paper presents ViHallu, aVision-Centric Hallucination mitigation framework that enhances visual-semanticalignment through Visual Variation Image Generation and Visual InstructionConstruction. ViHallu introduces visual variation images with controllablevisual alterations while maintaining the overall image structure. These images,combined with carefully constructed visual instructions, enable LVLMs to betterunderstand fine-grained visual content through fine-tuning, allowing models tomore precisely capture the correspondence between visual content and text,thereby enhancing visual-semantic alignment. Extensive experiments on multiplebenchmarks show that ViHallu effectively enhances models' fine-grained visualunderstanding while significantly reducing hallucination tendencies.Furthermore, we release ViHallu-Instruction, a visual instruction datasetspecifically designed for hallucination mitigation and visual-semanticalignment. Code is available at https://github.com/oliviadzy/ViHallu.