NVLM: Open Frontier-Class Multimodal LLMs

Abstract

We introduce NVLM 1.0, a family of frontier-class multimodal large languagemodels (LLMs) that achieve state-of-the-art results on vision-language tasks,rivaling the leading proprietary models (e.g., GPT-4o) and open-access models(e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improvedtext-only performance over its LLM backbone after multimodal training. In termsof model design, we perform a comprehensive comparison between decoder-onlymultimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g.,Flamingo). Based on the strengths and weaknesses of both approaches, we proposea novel architecture that enhances both training efficiency and multimodalreasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design fortile-based dynamic high-resolution images, which significantly boostsperformance on multimodal reasoning and OCR-related tasks. Regarding trainingdata, we meticulously curate and provide detailed information on our multimodalpretraining and supervised fine-tuning datasets. Our findings indicate thatdataset quality and task diversity are more important than scale, even duringthe pretraining phase, across all architectures. Notably, we developproduction-grade multimodality for the NVLM-1.0 models, enabling them to excelin vision-language tasks while maintaining and even improving text-onlyperformance compared to their LLM backbones. To achieve this, we craft andintegrate a high-quality text-only dataset into multimodal training, alongsidea substantial amount of multimodal math and reasoning data, leading to enhancedmath and coding capabilities across modalities. To advance research in thefield, we are releasing the model weights and will open-source the code for thecommunity: https://nvlm-project.github.io/.

Quick Read (beta)

loading the full paper ...