Abstract
Visual events are usually accompanied by sounds in our daily lives. We posethe question: Can the machine learn the correspondence between visual scene andthe sound, and localize the sound source only by observing sound and visualscene pairs like human? In this paper, we propose a novel unsupervisedalgorithm to address the problem of localizing the sound source in visualscenes. A two-stream network structure which handles each modality, withattention mechanism is developed for sound source localization. Moreover,although our network is formulated within the unsupervised learning framework,it can be extended to a unified architecture with a simple modification for thesupervised and semi-supervised learning settings as well. Meanwhile, a newsound source dataset is developed for performance evaluation. Our empiricalevaluation shows that the unsupervised method eventually go through falseconclusion in some cases. We show that even with a few supervision, falseconclusion is able to be corrected and the source of sound in a visual scenecan be localized effectively.