Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization

Abstract

The emergence of large Vision Language Models (VLMs) has broadened the scopeand capabilities of single-modal Large Language Models (LLMs) by integratingvisual modalities, thereby unlocking transformative cross-modal applications ina variety of real-world scenarios. Despite their impressive performance, VLMsare prone to significant hallucinations, particularly in the form ofcross-modal inconsistencies. Building on the success of Reinforcement Learningfrom Human Feedback (RLHF) in aligning LLMs, recent advancements have focusedon applying direct preference optimization (DPO) on carefully curated datasetsto mitigate these issues. Yet, such approaches typically introduce preferencesignals in a brute-force manner, neglecting the crucial role of visualinformation in the alignment process. In this paper, we introduce Re-Align, anovel alignment framework that leverages image retrieval to construct adual-preference dataset, effectively incorporating both textual and visualpreference signals. We further introduce rDPO, an extension of the standarddirect preference optimization that incorporates an additional visualpreference objective during fine-tuning. Our experimental results demonstratethat Re-Align not only mitigates hallucinations more effectively than previousmethods but also yields significant performance gains in general visualquestion-answering (VQA) tasks. Moreover, we show that Re-Align maintainsrobustness and scalability across a wide range of VLM sizes and architectures.This work represents a significant step forward in aligning multimodal LLMs,paving the way for more reliable and effective cross-modal applications. Werelease all the code in https://github.com/taco-group/Re-Align.

Quick Read (beta)

loading the full paper ...