Abstract
Our objective is audio-visual synchronization with a focus on 'in-the-wild'videos, such as those on YouTube, where synchronization cues can be sparse. Ourcontributions include a novel audio-visual synchronization model, and trainingthat decouples feature extraction from synchronization modelling throughmulti-modal segment-level contrastive pre-training. This approach achievesstate-of-the-art performance in both dense and sparse settings. We also extendsynchronization model training to AudioSet a million-scale 'in-the-wild'dataset, investigate evidence attribution techniques for interpretability, andexplore a new capability for synchronization models: audio-visualsynchronizability.
Quick Read (beta)
loading the full paper ...