Abstract
Current language models fall short in understanding aspects of the world noteasily described in words, and struggle with complex, long-form tasks. Videosequences offer valuable temporal information absent in language and staticimages, making them attractive for joint modeling with language. Such modelscould develop a understanding of both human textual knowledge and the physicalworld, enabling broader AI capabilities for assisting humans. However, learningfrom millions of tokens of video and language sequences poses challenges due tomemory constraints, computational complexity, and limited datasets. To addressthese challenges, we curate a large dataset of diverse videos and books,utilize the Blockwise RingAttention technique to scalably train on longsequences, and gradually increase context size from 4K to 1M tokens. This papermakes the following contributions: (a) Largest context size neural network: Wetrain one of the largest context size transformers on long video and languagesequences, setting new benchmarks in difficult retrieval tasks and long videounderstanding. (b) Solutions for overcoming vision-language trainingchallenges, including using masked sequence packing for mixing differentsequence lengths, loss weighting to balance language and vision, andmodel-generated QA dataset for long sequence chat. (c) A highly-optimizedimplementation with RingAttention, Blockwise Transformers, masked sequencepacking, and other key features for training on millions-length multimodalsequences. (d) Fully open-sourced a family of 7B parameter models capable ofprocessing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM,LWM-Chat) of over 1M tokens. This work paves the way for training on massivedatasets of long video and language to develop understanding of both humanknowledge and the multimodal world, and broader capabilities.