Abstract
Multi-modal Large Language Models (MLLMs) have shown impressive abilities ingenerating reasonable responses with respect to multi-modal contents. However,there is still a wide gap between the performance of recent MLLM-basedapplications and the expectation of the broad public, even though the mostpowerful OpenAI's GPT-4 and Google's Gemini have been deployed. This paperstrives to enhance understanding of the gap through the lens of a qualitativestudy on the generalizability, trustworthiness, and causal reasoningcapabilities of recent proprietary and open-source MLLMs across fourmodalities: ie, text, code, image, and video, ultimately aiming to improve thetransparency of MLLMs. We believe these properties are several representativefactors that define the reliability of MLLMs, in supporting various downstreamapplications. To be specific, we evaluate the closed-source GPT-4 and Geminiand 6 open-source LLMs and MLLMs. Overall we evaluate 230 manually designedcases, where the qualitative results are then summarized into 12 scores (ie, 4modalities times 3 properties). In total, we uncover 14 empirical findings thatare useful to understand the capabilities and limitations of both proprietaryand open-source MLLMs, towards more reliable downstream multi-modalapplications.