X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Abstract

Large language models (LLMs) have demonstrated remarkable language abilities.GPT-4, based on advanced LLMs, exhibits extraordinary multimodal capabilitiesbeyond previous visual language models. We attribute this to the use of moreadvanced LLMs compared with previous multimodal models. Unfortunately, themodel architecture and training strategies of GPT-4 are unknown. To endow LLMswith multimodal capabilities, we propose X-LLM, which converts Multi-modalities(images, speech, videos) into foreign languages using X2L interfaces and inputsthem into a large Language model (ChatGLM). Specifically, X-LLM aligns multiplefrozen single-modal encoders and a frozen LLM using X2L interfaces, where ``X''denotes multi-modalities such as image, speech, and videos, and ``L'' denoteslanguages. X-LLM's training consists of three stages: (1) Converting MultimodalInformation: The first stage trains each X2L interface to align with itsrespective single-modal encoder separately to convert multimodal informationinto languages. (2) Aligning X2L representations with the LLM: single-modalencoders are aligned with the LLM through X2L interfaces independently. (3)Integrating multiple modalities: all single-modal encoders are aligned with theLLM through X2L interfaces to integrate multimodal capabilities into the LLM.Our experiments show that X-LLM demonstrates impressive multimodel chatabilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseenimages/instructions, and yields a 84.5\% relative score compared with GPT-4 ona synthetic multimodal instruction-following dataset. And we also conductquantitative tests on using LLM for ASR and multimodal ASR, hoping to promotethe era of LLM-based speech recognition.

Quick Read (beta)

loading the full paper ...