From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

  • 2024-01-29 15:18:45
  • Chaochao Lu, Chen Qian, Guodong Zheng, Hongxing Fan, Hongzhi Gao, Jie Zhang, Jing Shao, Jingyi Deng, Jinlan Fu, Kexin Huang, Kunchang Li, Lijun Li, Limin Wang, Lu Sheng, Meiqi Chen, Ming Zhang, Qibing Ren, Sirui Chen, Tao Gui, Wanli Ouyang, Yali Wang, Yan Teng, Yaru Wang, Yi Wang, Yinan He, Yingchun Wang, Yixu Wang, Yongting Zhang, Yu Qiao, Yujiong Shen, Yurong Mou, Yuxi Chen, Zaibin Zhang, Zhelun Shi, Zhenfei Yin, Zhipin Wang
  • 0

Abstract

Multi-modal Large Language Models (MLLMs) have shown impressive abilities ingenerating reasonable responses with respect to multi-modal contents. However,there is still a wide gap between the performance of recent MLLM-basedapplications and the expectation of the broad public, even though the mostpowerful OpenAI's GPT-4 and Google's Gemini have been deployed. This paperstrives to enhance understanding of the gap through the lens of a qualitativestudy on the generalizability, trustworthiness, and causal reasoningcapabilities of recent proprietary and open-source MLLMs across fourmodalities: ie, text, code, image, and video, ultimately aiming to improve thetransparency of MLLMs. We believe these properties are several representativefactors that define the reliability of MLLMs, in supporting various downstreamapplications. To be specific, we evaluate the closed-source GPT-4 and Geminiand 6 open-source LLMs and MLLMs. Overall we evaluate 230 manually designedcases, where the qualitative results are then summarized into 12 scores (ie, 4modalities times 3 properties). In total, we uncover 14 empirical findings thatare useful to understand the capabilities and limitations of both proprietaryand open-source MLLMs, towards more reliable downstream multi-modalapplications.

 

Quick Read (beta)

loading the full paper ...