Abstract
This study introduces CLASP (Contrastive Language-Speech Pretraining), amultilingual, multimodal representation tailored for audio-text informationretrieval. CLASP leverages the synergy between spoken content and textual data.During training, we utilize our newly introduced speech-text dataset, whichencompasses 15 diverse categories ranging from fiction to religion. CLASP'saudio component integrates audio spectrograms with a pre-trainedself-supervised speech model, while its language encoding counterpart employs asentence encoder pre-trained on over 100 languages. This unified lightweightmodel bridges the gap between various modalities and languages, enhancing itseffectiveness in handling and retrieving multilingual and multimodal data. Ourevaluations across multiple languages demonstrate that CLASP establishes newbenchmarks in HITS@1, MRR, and meanR metrics, outperforming traditionalASR-based retrieval methods that rely on transcribing speech into text forsubsequent text retrieval, especially in specific scenarios.