A Survey of Serverless Machine Learning Model Inference

Abstract

Recent developments in Generative AI, Computer Vision, and Natural LanguageProcessing have led to an increased integration of AI models into variousproducts. This widespread adoption of AI requires significant efforts indeploying these models in production environments. When hosting machinelearning models for real-time predictions, it is important to meet definedService Level Objectives (SLOs), ensuring reliability, minimal downtime, andoptimizing operational costs of the underlying infrastructure. Large machinelearning models often demand GPU resources for efficient inference to meetSLOs. In the context of these trends, there is growing interest in hosting AImodels in a serverless architecture while still providing GPU access forinference tasks. This survey aims to summarize and categorize the emergingchallenges and optimization opportunities for large-scale deep learning servingsystems. By providing a novel taxonomy and summarizing recent trends, we hopethat this survey could shed light on new optimization perspectives and motivatenovel works in large-scale deep learning serving systems.

Quick Read (beta)

loading the full paper ...