Abstract
Today's most advanced multimodal models remain proprietary. The strongestopen-weight models rely heavily on synthetic data from proprietary VLMs toachieve good performance, effectively distilling these closed models into openones. As a result, the community is still missing foundational knowledge abouthow to build performant VLMs from scratch. We present Molmo, a new family ofVLMs that are state-of-the-art in their class of openness. Our key innovationis a novel, highly detailed image caption dataset collected entirely from humanannotators using speech-based descriptions. To enable a wide array of userinteractions, we also introduce a diverse dataset mixture for fine-tuning thatincludes in-the-wild Q&A and innovative 2D pointing data. The success of ourapproach relies on careful choices for the model architecture details, awell-tuned training pipeline, and, most critically, the quality of our newlycollected datasets, all of which will be released. The best-in-class 72B modelwithin the Molmo family not only outperforms others in the class of open weightand data models but also compares favorably against proprietary systems likeGPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and humanevaluation. We will be releasing all of our model weights, captioning and fine-tuningdata, and source code in the near future. Select model weights, inference code,and demo are available at https://molmo.allenai.org.