Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities,leading to a significant increase in user demand for LLM services. However,cloud-based LLM services often suffer from high latency, unstableresponsiveness, and privacy concerns. Therefore, multiple LLMs are usuallydeployed at the network edge to boost real-time responsiveness and protect dataprivacy, particularly for many emerging smart mobile and IoT applications.Given the varying response quality and latency of LLM services, a criticalissue is how to route user requests from mobile and IoT devices to anappropriate LLM service (i.e., edge LLM expert) to ensure acceptablequality-of-service (QoS). Existing routing algorithms fail to simultaneouslyaddress the heterogeneity of LLM services, the interference among requests, andthe dynamic workloads necessary for maintaining long-term stable QoS. To meetthese challenges, in this paper we propose a novel deep reinforcement learning(DRL)-based QoS-aware LLM routing framework for sustained high-quality LLMservices. Due to the dynamic nature of the global state, we propose a dynamicstate abstraction technique to compactly represent global state features with aheterogeneous graph attention network (HAN). Additionally, we introduce anaction impact estimator and a tailored reward function to guide the DRL agentin maximizing QoS and preventing latency violations. Extensive experiments onboth Poisson and real-world workloads demonstrate that our proposed algorithmsignificantly improves average QoS and computing resource efficiency comparedto existing baselines.