Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing

Abstract

With the advancement of serverless computing, running machine learning (ML)inference services over a serverless platform has been advocated, given itslabor-free scalability and cost effectiveness. Mixture-of-Experts (MoE) modelshave been a dominant type of model architectures to enable large modelsnowadays, with parallel expert networks. Serving large MoE models on serverlesscomputing is potentially beneficial, but has been underexplored due tosubstantial challenges in handling the skewed expert popularity andscatter-gather communication bottleneck in MoE model execution, forcost-efficient serverless MoE deployment and performance guarantee. We studyoptimized MoE model deployment and distributed inference serving on aserverless platform, that effectively predict expert selection, pipelinecommunication with model execution, and minimize the overall billed cost ofserving MoE models. Especially, we propose a Bayesian optimization frameworkwith multi-dimensional epsilon-greedy search to learn expert selections andoptimal MoE deployment achieving optimal billed cost, including: 1) a Bayesiandecision-making method for predicting expert popularity; 2) flexibly pipelinedscatter-gather communication; and 3) an optimal model deployment algorithm fordistributed MoE serving. Extensive experiments on AWS Lambda show that ourdesigns reduce the billed cost of all MoE layers by at least 75.67% compared toCPU clusters while maintaining satisfactory inference throughput. As comparedto LambdaML in serverless computing, our designs achieves 43.41% lower costwith a throughput decrease of at most 18.76%.

Quick Read (beta)

loading the full paper ...