The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Abstract

Recent advancements in large multimodal models (LMMs) have significantlyenhanced performance across diverse tasks, with ongoing efforts to furtherintegrate additional modalities such as video and audio. However, most existingLMMs remain vulnerable to hallucinations, the discrepancy between the factualmultimodal input and the generated textual output, which has limited theirapplicability in various real-world scenarios. This paper presents the firstsystematic investigation of hallucinations in LMMs involving the three mostcommon modalities: language, visual, and audio. Our study reveals two keycontributors to hallucinations: overreliance on unimodal priors and spuriousinter-modality correlations. To address these challenges, we introduce thebenchmark The Curse of Multi-Modalities (CMM), which comprehensively evaluateshallucinations in LMMs, providing a detailed analysis of their underlyingissues. Our findings highlight key vulnerabilities, including imbalances inmodality integration and biases from training data, underscoring the need forbalanced cross-modal learning and enhanced hallucination mitigation strategies.Based on our observations and findings, we suggest potential researchdirections that could enhance the reliability of LMMs.

Quick Read (beta)

loading the full paper ...