Abstract
In recent years, the field of vision-language model pre-training hasexperienced rapid advancements, driven primarily by the continuous enhancementof textual capabilities in large language models. However, existing trainingparadigms for multimodal large language models heavily rely on high-qualityimage-text pairs. As models and data scales grow exponentially, theavailability of such meticulously curated data has become increasingly scarceand saturated, thereby severely limiting further advancements in this domain.This study investigates scalable caption generation techniques forvision-language model pre-training and demonstrates that large-scalelow-hallucination synthetic captions can serve dual purposes: 1) acting as aviable alternative to real-world data for pre-training paradigms and 2)achieving superior performance enhancement when integrated into vision-languagemodels through empirical validation. This paper presents three keycontributions: 1) a novel pipeline for generating high-quality,low-hallucination, and knowledge-rich synthetic captions. Our continuous DPOmethodology yields remarkable results in reducing hallucinations. Specifically,the non-hallucination caption rate on a held-out test set increases from 48.2%to 77.9% for a 7B-size model. 2) Comprehensive empirical validation revealsthat our synthetic captions confer superior pre-training advantages over theircounterparts. Across 35 vision language tasks, the model trained with our dataachieves a significant performance gain of at least 6.2% compared to alt-textpairs and other previous work. Meanwhile, it also offers considerable supportin the text-to-image domain. With our dataset, the FID score is reduced by 17.1on a real-world validation benchmark and 13.3 on the MSCOCO validationbenchmark. 3) We will release Hunyuan-Recap100M, a low-hallucination andknowledge-intensive synthetic caption dataset.