IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities

Abstract

In the field of multimodal large language models (MLLMs), common methodstypically involve unfreezing the language model during training to fosterprofound visual understanding. However, the fine-tuning of such models withvision-language data often leads to a diminution of their natural languageprocessing (NLP) capabilities. To avoid this performance degradation, astraightforward solution is to freeze the language model while developingmultimodal competencies. Unfortunately, previous works have not attainedsatisfactory outcomes. Building on the strategy of freezing the language model,we conduct thorough structural exploration and introduce the Inner-AdaptorArchitecture (IAA). Specifically, the architecture incorporates multiplemultimodal adaptors at varying depths within the large language model tofacilitate direct interaction with the inherently text-oriented transformerlayers, thereby enabling the frozen language model to acquire multimodalcapabilities. Unlike previous approaches of freezing language models thatrequire large-scale aligned data, our proposed architecture is able to achievesuperior performance on small-scale datasets. We conduct extensive experimentsto improve the general multimodal capabilities and visual grounding abilitiesof the MLLM. Our approach remarkably outperforms previous state-of-the-artmethods across various vision-language benchmarks without sacrificingperformance on NLP tasks. Code and models are available athttps://github.com/360CVGroup/Inner-Adaptor-Architecture.

Quick Read (beta)

loading the full paper ...