VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

Abstract

We present a framework for learning to generate background music from videoinputs. Unlike existing works that rely on symbolic musical annotations, whichare limited in quantity and diversity, our method leverages large-scale webvideos accompanied by background music. This enables our model to learn togenerate realistic and diverse music. To accomplish this goal, we develop agenerative video-music Transformer with a novel semantic video-music alignmentscheme. Our model uses a joint autoregressive and contrastive learningobjective, which encourages the generation of music aligned with high-levelvideo content. We also introduce a novel video-beat alignment scheme to matchthe generated music beats with the low-level motions in the video. Lastly, tocapture fine-grained visual cues in a video needed for realistic backgroundmusic generation, we introduce a new temporal video encoder architecture,allowing us to efficiently process videos consisting of many densely sampledframes. We train our framework on our newly curated DISCO-MV dataset,consisting of 2.2M video-music samples, which is orders of magnitude largerthan any prior datasets used for video music generation. Our method outperformsexisting approaches on the DISCO-MV and MusicCaps datasets according to variousmusic generation evaluation metrics, including human evaluation. Results areavailable at https://genjib.github.io/project_page/VMAs/index.html

Quick Read (beta)

loading the full paper ...