Abstract
Multimodal Audio-Language Models (ALMs) can understand and reason over bothaudio and text. Typically, reasoning performance correlates with model size,with the best results achieved by models exceeding 8 billion parameters.However, no prior work has explored enabling small audio-language models toperform reasoning tasks, despite the potential applications for edge devices.To address this gap, we introduce Mellow, a small Audio-Language Modelspecifically designed for reasoning. Mellow achieves state-of-the-artperformance among existing small audio-language models and surpasses severallarger models in reasoning capabilities. For instance, Mellow scores 52.11 onMMAU, comparable to SoTA Qwen2 Audio (which scores 52.5) while using 50 timesfewer parameters and being trained on 60 times less data (audio hrs). To trainMellow, we introduce ReasonAQA, a dataset designed to enhance audio-groundedreasoning in models. It consists of a mixture of existing datasets (30% of thedata) and synthetically generated data (70%). The synthetic dataset is derivedfrom audio captioning datasets, where Large Language Models (LLMs) generatedetailed and multiple-choice questions focusing on audio events, objects,acoustic scenes, signal properties, semantics, and listener emotions. Toevaluate Mellow's reasoning ability, we benchmark it on a diverse set of tasks,assessing on both in-distribution and out-of-distribution data, including audiounderstanding, deductive reasoning, and comparative reasoning. Finally, weconduct extensive ablation studies to explore the impact of projection layerchoices, synthetic data generation methods, and language model pretraining onreasoning performance. Our training dataset, findings, and baseline pave theway for developing small ALMs capable of reasoning.