Abstract
The surge in generative AI workloads has created a need for scalableinference systems that can flexibly harness both GPUs and specializedaccelerators while containing operational costs. This paper proposes ahardware-agnostic control loop that adaptively allocates requests acrossheterogeneous accelerators based on real-time cost and capacity signals. Theapproach sustains low latency and high throughput by dynamically shiftingbetween cost-optimized and capacity-optimized modes, ensuring the mostefficient use of expensive compute resources under fluctuating availability.Evaluated using the Stable Diffusion model, the framework consistently meetslatency targets, automatically redirects traffic during capacity shortfalls,and capitalizes on lower-cost accelerators when possible. These resultshighlight how a feedback-driven deployment strategy, spanning the entiresoftware and hardware stack, can help organizations efficiently scalegenerative AI workloads while maintaining resilience in the face of limitedaccelerator capacity.