Localizing Visual Sounds the Hard Way

Abstract

The objective of this work is to localize sound sources that are visible in avideo without using manual annotations. Our key technical contribution is toshow that, by training the network to explicitly discriminate challenging imagefragments, even for images that do contain the object emitting the sound, wecan significantly boost the localization performance. We do so elegantly byintroducing a mechanism to mine hard samples and add them to a contrastivelearning formulation automatically. We show that our algorithm achievesstate-of-the-art performance on the popular Flickr SoundNet dataset.Furthermore, we introduce the VGG-Sound Source (VGG-SS) benchmark, a new set ofannotations for the recently-introduced VGG-Sound dataset, where the soundsources visible in each video clip are explicitly marked with bounding boxannotations. This dataset is 20 times larger than analogous existing ones,contains 5K videos spanning over 200 categories, and, differently from FlickrSoundNet, is video-based. On VGG-SS, we also show that our algorithm achievesstate-of-the-art performance against several baselines.

Quick Read (beta)

loading the full paper ...