InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

  • 2025-04-14 18:59:25
  • Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Lijun Wu, Kaipeng Zhang, Huipeng Deng, Jiaye Ge, Kai Chen, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang
  • 0

Abstract

We introduce InternVL3, a significant advancement in the InternVL seriesfeaturing a native multimodal pre-training paradigm. Rather than adapting atext-only large language model (LLM) into a multimodal large language model(MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal andlinguistic capabilities from both diverse multimodal data and pure-text corporaduring a single pre-training stage. This unified training paradigm effectivelyaddresses the complexities and alignment challenges commonly encountered inconventional post-hoc training pipelines for MLLMs. To further improveperformance and scalability, InternVL3 incorporates variable visual positionencoding (V2PE) to support extended multimodal contexts, employs advancedpost-training techniques such as supervised fine-tuning (SFT) and mixedpreference optimization (MPO), and adopts test-time scaling strategiesalongside an optimized training infrastructure. Extensive empirical evaluationsdemonstrate that InternVL3 delivers superior performance across a wide range ofmulti-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on theMMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Itscapabilities remain highly competitive with leading proprietary models,including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while alsomaintaining strong pure-language proficiency. In pursuit of open-scienceprinciples, we will publicly release both the training data and model weightsto foster further research and development in next-generation MLLMs.

 

Quick Read (beta)

loading the full paper ...