Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

  • 2024-09-25 18:59:51
  • Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Jen Dumas, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi
  • 0

Abstract

Today's most advanced multimodal models remain proprietary. The strongestopen-weight models rely heavily on synthetic data from proprietary VLMs toachieve good performance, effectively distilling these closed models into openones. As a result, the community is still missing foundational knowledge abouthow to build performant VLMs from scratch. We present Molmo, a new family ofVLMs that are state-of-the-art in their class of openness. Our key innovationis a novel, highly detailed image caption dataset collected entirely from humanannotators using speech-based descriptions. To enable a wide array of userinteractions, we also introduce a diverse dataset mixture for fine-tuning thatincludes in-the-wild Q&A and innovative 2D pointing data. The success of ourapproach relies on careful choices for the model architecture details, awell-tuned training pipeline, and, most critically, the quality of our newlycollected datasets, all of which will be released. The best-in-class 72B modelwithin the Molmo family not only outperforms others in the class of open weightand data models but also compares favorably against proprietary systems likeGPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and humanevaluation. We will be releasing all of our model weights, captioning and fine-tuningdata, and source code in the near future. Select model weights, inference code,and demo are available at https://molmo.allenai.org.

 

Quick Read (beta)

loading the full paper ...