Abstract
In light of recent advances in multimodal Large Language Models (LLMs), thereis increasing attention to scaling them from image-text data to moreinformative real-world videos. Compared to static images, video poses uniquechallenges for effective large-scale pre-training due to the modeling of itsspatiotemporal dynamics. In this paper, we address such limitations invideo-language pre-training with an efficient video decomposition thatrepresents each video as keyframes and temporal motions. These are then adaptedto an LLM using well-designed tokenizers that discretize visual and temporalinformation as a few tokens, thus enabling unified generative pre-training ofvideos, images, and text. At inference, the generated tokens from the LLM arecarefully recovered to the original continuous pixel space to create variousvideo content. Our proposed framework is both capable of comprehending andgenerating image and video content, as demonstrated by its competitiveperformance across 13 multimodal benchmarks in image and video understandingand generation. Our code and models will be available athttps://video-lavit.github.io.