MM-LLMs: Recent Advances in MultiModal Large Language Models

Abstract

In the past year, MultiModal Large Language Models (MM-LLMs) have undergonesubstantial advancements, augmenting off-the-shelf LLMs to support MM inputs oroutputs via cost-effective training strategies. The resulting models not onlypreserve the inherent reasoning and decision-making capabilities of LLMs butalso empower a diverse range of MM tasks. In this paper, we provide acomprehensive survey aimed at facilitating further research of MM-LLMs.Specifically, we first outline general design formulations for modelarchitecture and training pipeline. Subsequently, we provide briefintroductions of $26$ existing MM-LLMs, each characterized by its specificformulations. Additionally, we review the performance of MM-LLMs on mainstreambenchmarks and summarize key training recipes to enhance the potency ofMM-LLMs. Lastly, we explore promising directions for MM-LLMs while concurrentlymaintaining a real-time tracking website for the latest developments in thefield. We hope that this survey contributes to the ongoing advancement of theMM-LLMs domain.

Quick Read (beta)

loading the full paper ...