Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Abstract

Preference alignment has become a crucial component in enhancing theperformance of Large Language Models (LLMs), yet its impact in Multimodal LargeLanguage Models (MLLMs) remains comparatively underexplored. Similar tolanguage models, MLLMs for image understanding tasks encounter challenges likehallucination. In MLLMs, hallucination can occur not only by stating incorrectfacts but also by producing responses that are inconsistent with the imagecontent. A primary objective of alignment for MLLMs is to encourage thesemodels to align responses more closely with image information. Recently,multiple works have introduced preference datasets for MLLMs and examineddifferent alignment methods, including Direct Preference Optimization (DPO) andProximal Policy Optimization (PPO). However, due to variations in datasets,base model types, and alignment methods, it remains unclear which specificelements contribute most significantly to the reported improvements in theseworks. In this paper, we independently analyze each aspect of preferencealignment in MLLMs. We start by categorizing the alignment algorithms into twogroups, offline (such as DPO), and online (such as online-DPO), and show thatcombining offline and online methods can improve the performance of the modelin certain scenarios. We review a variety of published multimodal preferencedatasets and discuss how the details of their construction impact modelperformance. Based on these insights, we introduce a novel way of creatingmultimodal preference data called Bias-Driven Hallucination Sampling (BDHS)that needs neither additional annotation nor external models, and show that itcan achieve competitive performance to previously published alignment work formultimodal models across a range of benchmarks.

Quick Read (beta)

loading the full paper ...