Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning

  • 2025-06-10 14:37:48
  • Choi Changin, Lim Sungjun, Rhee Wonjong
  • 0

Abstract

Retrieval-augmented generation can improve audio captioning by incorporatingrelevant audio-text pairs from a knowledge base. Existing methods typicallyrely solely on the input audio as a unimodal retrieval query. In contrast, wepropose Generation-Assisted Multimodal Querying, which generates a textdescription of the input audio to enable multimodal querying. This approachaligns the query modality with the audio-text structure of the knowledge base,leading to more effective retrieval. Furthermore, we introduce a novelprogressive learning strategy that gradually increases the number ofinterleaved audio-text pairs to enhance the training process. Our experimentson AudioCaps, Clotho, and Auto-ACD demonstrate that our approach achievesstate-of-the-art results across these benchmarks.

 

Quick Read (beta)

loading the full paper ...