Audio Retrieval with WavText5K and CLAP Training

Abstract

Audio-Text retrieval takes a natural language query to retrieve relevantaudio files in a database. Conversely, Text-Audio retrieval takes an audio fileas a query to retrieve relevant natural language descriptions. Most of theliterature train retrieval systems with one audio captioning dataset, butevaluating the benefit of training with multiple datasets is underexplored.Moreover, retrieval systems have to learn the alignment between elaboratedsentences describing audio content of variable length ranging from a fewseconds to several minutes. In this work, we propose a new collection of webaudio-text pairs and a new framework for retrieval. First, we provide a newcollection of about five thousand web audio-text pairs that we refer to asWavText5K. When used to train our retrieval system, WavText5K improvedperformance more than other audio captioning datasets. Second, our frameworklearns to connect language and audio content by using a text encoder, two audioencoders, and a contrastive learning objective. Combining both audio encodershelps to process variable length audio. The two contributions beat state of theart performance for AudioCaps and Clotho on Text-Audio retrieval by a relative2% and 16%, and Audio-Text retrieval by 6% and 23%.

Quick Read (beta)

loading the full paper ...