Abstract
The rapid emergence of diverse large language models (LLMs) has spurred thedevelopment of LLM routers that assign user queries to the most suitable model.However, existing LLM routers typically perform a single-round, one-to-onemapping (\textit{i.e.}, assigning each query to a single model in isolation),which limits their capability to tackle complex tasks that demand thecomplementary strengths of multiple LLMs. In this paper, we present\textbf{Router-R1}, a reinforcement learning (RL)-based framework thatformulates multi-LLM routing and aggregation as a sequential decision process.Router-R1 instantiates the router itself as a capable LLM, leveraging itsreasoning ability to interleave "think" actions (internal deliberation) with"route" actions (dynamic model invocation), and integrates each response intoits evolving context. To guide learning, we employ a lightweight rule-basedreward comprising format rewards, final outcome rewards, and a novel costreward for performance and cost trade-off optimization, opening a pathwaytoward optimizing performance-cost tradeoffs via RL. Router-R1 also conditionsonly on simple model descriptors such as pricing, latency, and exampleperformance, enabling strong generalization to unseen model selection.Experiments on seven general and multi-hop QA benchmarks show that Router-R1outperforms over several strong baselines, achieving superior performance whilemaintaining robust generalization and cost management.Code is available athttps://github.com/ulab-uiuc/Router-R1.