Abstract
In this paper, we focus on monolithic Multimodal Large Language Models(MLLMs) that integrate visual encoding and language decoding into a single LLM.In particular, we identify that existing pre-training strategies for monolithicMLLMs often suffer from unstable optimization or catastrophic forgetting. Toaddress this issue, our core idea is to embed a new visual parameter space intoa pre-trained LLM, thereby stably learning visual knowledge from noisy datawhile freezing the LLM. Based on this principle, we present Mono-InternVL, anovel monolithic MLLM that seamlessly integrates a set of visual experts via amultimodal mixture-of-experts structure. Moreover, we propose an innovativepre-training strategy to maximize the visual capability of Mono-InternVL,namely Endogenous Visual Pre-training (EViP). In particular, EViP is designedas a progressive learning process for visual experts, which aims to fullyexploit the visual knowledge from noisy data to high-quality data. To validateour approach, we conduct extensive experiments on 16 benchmarks. Experimentalresults confirm the superior performance of Mono-InternVL than existingmonolithic MLLMs on 13 of 16 multimodal benchmarks, e.g., +80 points over Emu3on OCRBench. Compared to the modular baseline, i.e., InternVL-1.5,Mono-InternVL still retains comparable multimodal performance while reducing upto 67% first token latency. Code and model are released athttps://huggingface.co/OpenGVLab/Mono-InternVL-2B.