Abstract
The development of video large multimodal models (LMMs) has been hindered bythe difficulty of curating large amounts of high-quality raw data from the web.To address this, we propose an alternative approach by creating a high-qualitysynthetic dataset specifically for video instruction-following, namelyLLaVA-Video-178K. This dataset includes key tasks such as detailed captioning,open-ended question-answering (QA), and multiple-choice QA. By training on thisdataset, in combination with existing visual instruction tuning data, weintroduce LLaVA-Video, a new video LMM. Our experiments demonstrate thatLLaVA-Video achieves strong performance across various video benchmarks,highlighting the effectiveness of our dataset. We plan to release the dataset,its generation pipeline, and the model checkpoints.