AudioSetMix: Enhancing Audio-Language Datasets with LLM-Assisted Augmentations

Abstract

Multi-modal learning in the audio-language domain has seen significantadvancements in recent years. However, audio-language learning faces challengesdue to limited and lower-quality data compared to image-language tasks.Existing audio-language datasets are notably smaller, and manual labeling ishindered by the need to listen to entire audio clips for accurate labeling. Our method systematically generates audio-caption pairs by augmenting audioclips with natural language labels and corresponding audio signal processingoperations. Leveraging a Large Language Model, we generate descriptions ofaugmented audio clips with a prompt template. This scalable method producesAudioSetMix, a high-quality training dataset for text-and-audio related models. Integration of our dataset improves models performance on benchmarks byproviding diversified and better-aligned examples. Notably, our datasetaddresses the absence of modifiers (adjectives and adverbs) in existingdatasets. By enabling models to learn these concepts, and generating hardnegative examples during training, we achieve state-of-the-art performance onmultiple benchmarks.

Quick Read (beta)

loading the full paper ...