On the Out-Of-Distribution Generalization of Multimodal Large Language Models

Abstract

We investigate the generalization boundaries of current Multimodal LargeLanguage Models (MLLMs) via comprehensive evaluation under out-of-distributionscenarios and domain-specific tasks. We evaluate their zero-shot generalizationacross synthetic images, real-world distributional shifts, and specializeddatasets like medical and molecular imagery. Empirical results indicate thatMLLMs struggle with generalization beyond common training domains, limitingtheir direct application without adaptation. To understand the cause ofunreliable performance, we analyze three hypotheses: semanticmisinterpretation, visual feature extraction insufficiency, and mappingdeficiency. Results identify mapping deficiency as the primary hurdle. Toaddress this problem, we show that in-context learning (ICL) can significantlyenhance MLLMs' generalization, opening new avenues for overcominggeneralization barriers. We further explore the robustness of ICL underdistribution shifts and show its vulnerability to domain shifts, label shifts,and spurious correlation shifts between in-context examples and test data.

Quick Read (beta)

loading the full paper ...