Abstract
The rapid emergence of diverse large language models (LLMs) has spurred thedevelopment of LLM routers that assign user queries to the most suitable model.However, existing LLM routers typically perform a single-round, one-to-onemapping (\textit{i.e.}, assigning each query to a single model in isolation),which limits their capability to tackle complex tasks that demand thecomplementary strengths of multiple LLMs. In this paper, we present\textbf{Router-R1}, a reinforcement learning (RL)-based framework thatformulates multi-LLM routing and aggregation as a sequential decision process.Router-R1 instantiates the router itself as a capable LLM, leveraging itsreasoning ability to interleave "think" actions (internal deliberation) with"route" actions (dynamic model invocation), and integrates each response intoits evolving context. To facilitate learning, we employ a lightweightrule-based reward comprising format rewards, final outcome rewards, and a novelcost reward for optimizing the balance between performance and cost, opening apathway toward enhancing performance-cost trade-offs via RL. Router-R1 alsoconditions only on simple model descriptors such as pricing, latency, andexample performance, enabling strong generalization to unseen model selection.Experiments on seven general and multi-hop QA benchmarks show that Router-R1outperforms several strong baselines, achieving superior performance whilemaintaining robust generalization and cost management.