Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

Abstract

Fully open multimodal large language models (MLLMs) currently lag behindproprietary counterparts, primarily due to a significant gap in data qualityfor supervised fine-tuning (SFT). Existing open-source datasets are oftenplagued by widespread noise and a critical deficit in complex reasoning data,such as Chain-of-Thought (CoT), which hinders the development of advanced modelcapabilities. Addressing these challenges, our work makes three primarycontributions. First, we introduce Honey-Data-15M, a new SFT dataset comprisingapproximately 15 million QA pairs, processed through multiple cleaningtechniques and enhanced with a novel dual-level (short and long) CoT enrichmentstrategy. Second, we introduce HoneyPipe, the data curation pipeline, and itsunderlying framework DataStudio, providing the community with a transparent andadaptable methodology for data curation that moves beyond static datasetreleases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8Bmodel on Honey-Data-15M. Experiments show that Bee-8B establishes a newstate-of-the-art (SOTA) for fully open MLLMs, achieving performance that iscompetitive with, and in some cases surpasses, recent semi-open models such asInternVL3.5-8B. Our work delivers to the community a suite of foundationalresources, including: the Honey-Data-15M corpus; the full-stack suitecomprising HoneyPipe and DataStudio; training recipes; an evaluation harness;and the model weights. This effort demonstrates that a principled focus on dataquality is a key pathway to developing fully open MLLMs that are highlycompetitive with their semi-open counterparts.

Quick Read (beta)

loading the full paper ...