2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Abstract

Compared to image-text pair data, interleaved corpora enable Vision-LanguageModels (VLMs) to understand the world more naturally like humans. However, suchexisting datasets are crawled from webpage, facing challenges like lowknowledge density, loose image-text relations, and poor logical coherencebetween images. On the other hand, the internet hosts vast instructional videos(e.g., online geometry courses) that are widely used by humans to learnfoundational subjects, yet these valuable resources remain underexplored in VLMtraining. In this paper, we introduce a high-quality \textbf{multimodaltextbook} corpus with richer foundational knowledge for VLM pretraining. Itcollects over 2.5 years of instructional videos, totaling 22,000 class hours.We first use an LLM-proposed taxonomy to systematically gather instructionalvideos. Then we progressively extract and refine visual (keyframes), audio(ASR), and textual knowledge (OCR) from the videos, and organize as animage-text interleaved corpus based on temporal order. Compared to itscounterparts, our video-centric textbook offers more coherent context, richerknowledge, and better image-text alignment. Experiments demonstrate its superbpretraining performance, particularly in knowledge- and reasoning-intensivetasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbookexhibit outstanding interleaved context awareness, leveraging visual andtextual cues in their few-shot context for task solving~\footnote{Our code areavailable at \url{https://github.com/DAMO-NLP-SG/multimodal_textbook}}.

Quick Read (beta)

loading the full paper ...