Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

Abstract

The rapid emergence of diverse large language models (LLMs) has spurred thedevelopment of LLM routers that assign user queries to the most suitable model.However, existing LLM routers typically perform a single-round, one-to-onemapping (\textit{i.e.}, assigning each query to a single model in isolation),which limits their capability to tackle complex tasks that demand thecomplementary strengths of multiple LLMs. In this paper, we present\textbf{Router-R1}, a reinforcement learning (RL)-based framework thatformulates multi-LLM routing and aggregation as a sequential decision process.Router-R1 instantiates the router itself as a capable LLM, leveraging itsreasoning ability to interleave "think" actions (internal deliberation) with"route" actions (dynamic model invocation), and integrates each response intoits evolving context. To guide learning, we employ a lightweight rule-basedreward comprising format rewards, final outcome rewards, and a novel costreward for performance and cost trade-off optimization, opening a pathwaytoward optimizing performance-cost tradeoffs via RL. Router-R1 also conditionsonly on simple model descriptors such as pricing, latency, and exampleperformance, enabling strong generalization to unseen model selection.Experiments on seven general and multi-hop QA benchmarks show that Router-R1outperforms over several strong baselines, achieving superior performance whilemaintaining robust generalization and cost management.Code is available athttps://github.com/ulab-uiuc/Router-R1.

Quick Read (beta)

loading the full paper ...