Learning Spatially-Aware Language and Audio Embedding

Abstract

Humans can picture a sound scene given an imprecise natural languagedescription. For example, it is easy to imagine an acoustic environment given aphrase like "the lion roar came from right behind me!". For a machine to havethe same degree of comprehension, the machine must know what a lion is(semantic attribute), what the concept of "behind" is (spatial attribute) andhow these pieces of linguistic information align with the semantic and spatialattributes of the sound (what a roar sounds like when its coming from behind).State-of-the-art audio foundation models which learn to map between audioscenes and natural textual descriptions, are trained on non-spatial audio andtext pairs, and hence lack spatial awareness. In contrast, sound eventlocalization and detection models are limited to recognizing sounds from afixed number of classes, and they localize the source to absolute position(e.g., 0.2m) rather than a position described using natural language (e.g.,"next to me"). To address these gaps, we present ELSA a spatially aware-audioand text embedding model trained using multimodal contrastive learning. ELSAsupports non-spatial audio, spatial audio, and open vocabulary text captionsdescribing both the spatial and semantic components of sound. To train ELSA:(a) we spatially augment the audio and captions of three open-source audiodatasets totaling 4,738 hours of audio, and (b) we design an encoder to capturethe semantics of non-spatial audio, and the semantics and spatial attributes ofspatial audio using contrastive learning. ELSA is competitive withstate-of-the-art for both semantic retrieval and 3D source localization. Inparticular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 abovethe baseline, and outperforms by -11.6{\deg} mean-absolute-error in 3D sourcelocalization over the baseline.

Quick Read (beta)

loading the full paper ...