Soundini: Sound-Guided Diffusion for Natural Video Editing

Abstract

We propose a method for adding sound-guided visual effects to specificregions of videos with a zero-shot setting. Animating the appearance of thevisual effect is challenging because each frame of the edited video should havevisual changes while maintaining temporal consistency. Moreover, existing videoediting solutions focus on temporal consistency across frames, ignoring thevisual style variations over time, e.g., thunderstorm, wave, fire crackling. Toovercome this limitation, we utilize temporal sound features for the dynamicstyle. Specifically, we guide denoising diffusion probabilistic models with anaudio latent representation in the audio-visual latent space. To the best ofour knowledge, our work is the first to explore sound-guided natural videoediting from various sound sources with sound-specialized properties, such asintensity, timbre, and volume. Additionally, we design optical flow-basedguidance to generate temporally consistent video frames, capturing thepixel-wise relationship between adjacent frames. Experimental results show thatour method outperforms existing video editing techniques, producing morerealistic visual effects that reflect the properties of sound. Please visit ourpage: https://kuai-lab.github.io/soundini-gallery/.

Quick Read (beta)

loading the full paper ...