Self-supervised object detection from audio-visual correspondence

Abstract

We tackle the problem of learning object detectors without supervision.Differently from weakly-supervised object detection, we do not assumeimage-level class labels. Instead, we extract a supervisory signal fromaudio-visual data, using the audio component to "teach" the object detector.While this problem is related to sound source localisation, it is considerablyharder because the detector must classify the objects by type, enumerate eachinstance of the object, and do so even when the object is silent. We tacklethis problem by first designing a self-supervised framework with a contrastiveobjective that jointly learns to classify and localise objects. Then, withoutusing any supervision, we simply use these self-supervised labels and boxes totrain an image-based object detector. With this, we outperform previousunsupervised and weakly-supervised detectors for the task of object detectionand sound source localization. We also show that we can align this detector toground-truth classes with as little as one label per pseudo-class, and show howour method can learn to detect generic objects that go beyond instruments, suchas airplanes and cats.

Quick Read (beta)

loading the full paper ...