Abstract
Multimodal language analysis is a rapidly evolving field that leveragesmultiple modalities to enhance the understanding of high-level semanticsunderlying human conversational utterances. Despite its significance, littleresearch has investigated the capability of multimodal large language models(MLLMs) to comprehend cognitive-level semantics. In this paper, we introduceMMLA, a comprehensive benchmark specifically designed to address this gap. MMLAcomprises over 61K multimodal utterances drawn from both staged and real-worldscenarios, covering six core dimensions of multimodal semantics: intent,emotion, dialogue act, sentiment, speaking style, and communication behavior.We evaluate eight mainstream branches of LLMs and MLLMs using three methods:zero-shot inference, supervised fine-tuning, and instruction tuning. Extensiveexperiments reveal that even fine-tuned models achieve only about 60%~70%accuracy, underscoring the limitations of current MLLMs in understandingcomplex human language. We believe that MMLA will serve as a solid foundationfor exploring the potential of large language models in multimodal languageanalysis and provide valuable resources to advance this field. The datasets andcode are open-sourced at https://github.com/thuiar/MMLA.