SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning

Abstract

Recent work shows that reinforcement learning(RL) can markedly sharpen thereasoning ability of large language models (LLMs) by prompting them to "thinkbefore answering." Yet whether and how these gains transfer to audio-languagereasoning remains largely unexplored. We extend the Group-Relative PolicyOptimization (GRPO) framework from DeepSeek-R1 to a Large Audio-Language Model(LALM), and construct a 32k sample multiple-choice corpus. Using a two-stageregimen supervised fine-tuning on structured and unstructuredchains-of-thought, followed by curriculum-guided GRPO, we systematicallycompare implicit vs. explicit, and structured vs. free form reasoning underidentical architectures. Our structured audio reasoning model, SARI (StructuredAudio Reasoning via Curriculum-Guided Reinforcement Learning), achieves a16.35% improvement in average accuracy over the base modelQwen2-Audio-7B-Instruct. Furthermore, the variant built upon Qwen2.5-Omnireaches state-of-the-art performance of 67.08% on the MMAU test-mini benchmark.Ablation experiments show that on the base model we use: (i) SFT warm-up isimportant for stable RL training, (ii) structured chains yield more robustgeneralization than unstructured ones, and (iii) easy-to-hard curriculaaccelerate convergence and improve final performance. These findingsdemonstrate that explicit, structured reasoning and curriculum learningsubstantially enhances audio-language understanding.

Quick Read (beta)

loading the full paper ...