Abstract
Massive multi-modality datasets play a significant role in facilitating thesuccess of large video-language models. However, current video-languagedatasets primarily provide text descriptions for visual frames, consideringaudio to be weakly related information. They usually overlook exploring thepotential of inherent audio-visual correlation, leading to monotonousannotation within each modality instead of comprehensive and precisedescriptions. Such ignorance results in the difficulty of multiplecross-modality studies. To fulfill this gap, we present MMTrail, a large-scalemulti-modality video-language dataset incorporating more than 20M trailer clipswith visual captions, and 2M high-quality clips with multimodal captions.Trailers preview full-length video works and integrate context, visual frames,and background music. In particular, the trailer has two main advantages: (1)the topics are diverse, and the content characters are of various types, e.g.,film, news, and gaming. (2) the corresponding background music iscustom-designed, making it more coherent with the visual context. Upon theseinsights, we propose a systemic captioning framework, achieving variousmodality annotations with more than 27.1k hours of trailer videos. Here, toensure the caption retains music perspective while preserving the authority ofvisual context, we leverage the advanced LLM to merge all annotationsadaptively. In this fashion, our MMtrail dataset potentially paves the path forfine-grained large multimodal-language model training. In experiments, weprovide evaluation metrics and benchmark results on our dataset, demonstratingthe high quality of our annotation and its effectiveness for model training.