InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Abstract

We introduce InternVL3, a significant advancement in the InternVL seriesfeaturing a native multimodal pre-training paradigm. Rather than adapting atext-only large language model (LLM) into a multimodal large language model(MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal andlinguistic capabilities from both diverse multimodal data and pure-text corporaduring a single pre-training stage. This unified training paradigm effectivelyaddresses the complexities and alignment challenges commonly encountered inconventional post-hoc training pipelines for MLLMs. To further improveperformance and scalability, InternVL3 incorporates variable visual positionencoding (V2PE) to support extended multimodal contexts, employs advancedpost-training techniques such as supervised fine-tuning (SFT) and mixedpreference optimization (MPO), and adopts test-time scaling strategiesalongside an optimized training infrastructure. Extensive empirical evaluationsdemonstrate that InternVL3 delivers superior performance across a wide range ofmulti-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on theMMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Itscapabilities remain highly competitive with leading proprietary models,including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while alsomaintaining strong pure-language proficiency. In pursuit of open-scienceprinciples, we will publicly release both the training data and model weightsto foster further research and development in next-generation MLLMs.

Quick Read (beta)

loading the full paper ...