Audio-Sync Video Generation with Multi-Stream Temporal Control

Abstract

Audio is inherently temporal and closely synchronized with the visual world,making it a naturally aligned and expressive control signal for controllablevideo generation (e.g., movies). Beyond control, directly translating audiointo video is essential for understanding and visualizing rich audio narratives(e.g., Podcasts or historical recordings). However, existing approaches fallshort in generating high-quality videos with precise audio-visualsynchronization, especially across diverse and complex audio types. In thiswork, we introduce MTV, a versatile framework for audio-sync video generation.MTV explicitly separates audios into speech, effects, and music tracks,enabling disentangled control over lip motion, event timing, and visual mood,respectively -- resulting in fine-grained and semantically aligned videogeneration. To support the framework, we additionally present DEMIX, a datasetcomprising high-quality cinematic videos and demixed audio tracks. DEMIX isstructured into five overlapped subsets, enabling scalable multi-stage trainingfor diverse generation scenarios. Extensive experiments demonstrate that MTVachieves state-of-the-art performance across six standard metrics spanningvideo quality, text-video consistency, and audio-video alignment. Project page:https://hjzheng.net/projects/MTV/.

Quick Read (beta)

loading the full paper ...