Abstract
Retrieval-augmented generation can improve audio captioning by incorporatingrelevant audio-text pairs from a knowledge base. Existing methods typicallyrely solely on the input audio as a unimodal retrieval query. In contrast, wepropose Generation-Assisted Multimodal Querying, which generates a textdescription of the input audio to enable multimodal querying. This approachaligns the query modality with the audio-text structure of the knowledge base,leading to more effective retrieval. Furthermore, we introduce a novelprogressive learning strategy that gradually increases the number ofinterleaved audio-text pairs to enhance the training process. Our experimentson AudioCaps, Clotho, and Auto-ACD demonstrate that our approach achievesstate-of-the-art results across these benchmarks.