Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning

Abstract

Recent advancements in large language models, multimodal large languagemodels, and large audio language models (LALMs) have significantly improvedtheir reasoning capabilities through reinforcement learning with rule-basedrewards. However, the explicit reasoning process has yet to show significantbenefits for audio question answering, and effectively leveraging deepreasoning remains an open challenge, with LALMs still falling short ofhuman-level auditory-language reasoning. To address these limitations, wepropose Audio-Thinker, a reinforcement learning framework designed to enhancethe reasoning capabilities of LALMs, with a focus on improving adaptability,consistency, and effectiveness. Our approach introduces an adaptive thinkaccuracy reward, enabling the model to adjust its reasoning strategies based ontask complexity dynamically. Furthermore, we incorporate an external rewardmodel to evaluate the overall consistency and quality of the reasoning process,complemented by think-based rewards that help the model distinguish betweenvalid and flawed reasoning paths during training. Experimental resultsdemonstrate that our Audio-Thinker model outperforms existingreasoning-oriented LALMs across various benchmark tasks, exhibiting superiorreasoning and generalization capabilities.

Quick Read (beta)

loading the full paper ...