Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems

Abstract

Large Language Model (LLM)-based systems, i.e. interconnected elements thatinclude an LLM as a central component, such as conversational agents, areusually designed with monolithic, static architectures that rely on a single,general-purpose LLM to handle all user queries. However, these systems may beinefficient as different queries may require different levels of reasoning,domain knowledge or pre-processing. While generalist LLMs (e.g. GPT-4o,Claude-Sonnet) perform well across a wide range of tasks, they may incursignificant financial, energy and computational costs. These costs may bedisproportionate for simpler queries, resulting in unnecessary resourceutilisation. A routing mechanism can therefore be employed to route queries tomore appropriate components, such as smaller or specialised models, therebyimproving efficiency and optimising resource consumption. This survey aims toprovide a comprehensive overview of routing strategies in LLM-based systems.Specifically, it reviews when, why, and how routing should be integrated intoLLM pipelines to improve efficiency, scalability, and performance. We definethe objectives to optimise, such as cost minimisation and performancemaximisation, and discuss the timing of routing within the LLM workflow,whether it occurs before or after generation. We also detail the variousimplementation strategies, including similarity-based, supervised,reinforcement learning-based, and generative methods. Practical considerationssuch as industrial applications and current limitations are also examined, likestandardising routing experiments, accounting for non-financial costs, anddesigning adaptive strategies. By formalising routing as a performance-costoptimisation problem, this survey provides tools and directions to guide futureresearch and development of adaptive low-cost LLM-based systems.

Quick Read (beta)

loading the full paper ...