Synchformer: Efficient Synchronization from Sparse Cues

  • 2024-01-29 18:59:55
  • Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman
Our objective is audio-visual synchronization with a focus on 'in-the-wild'videos, such as those on YouTube, where synchronization cues can be sparse. Ourcontributions include a novel audio-visual synchronization model, and trainingthat decouples feature extraction from synchronization modelling throughmulti-modal segment-level contrastive pre-training. This approach achievesstate-of-the-art performance in both dense and sparse settings. We also extendsynchronization model training to AudioSet a million-scale 'in-the-wild'dataset, investigate evidence attribution techniques for interpretability, andexplore a new capability for synchronization models: audio-visualsynchronizability.


