ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models

Abstract

This paper presents ServerlessLLM, a locality-enhanced serverless inferencesystem for Large Language Models (LLMs). ServerlessLLM exploits the substantialcapacity and bandwidth of storage and memory devices available on GPU servers,thereby reducing costly remote checkpoint downloads and achieving efficientcheckpoint loading. ServerlessLLM achieves this through three maincontributions: (i) fast LLM checkpoint loading via a novel loading-optimizedcheckpoint format design, coupled with an efficient multi-tier checkpointloading system; (ii) locality-driven LLM inference with live migration, whichallows ServerlessLLM to effectively achieve locality-driven server allocationwhile preserving the low latency of ongoing LLM inference; and (iii)locality-aware server allocation, enabling ServerlessLLM to evaluate the statusof each server in a cluster and effectively schedule model startup time tocapitalize on local checkpoint placement. Our comprehensive experiments, whichinclude microbenchmarks and real-world traces, show that ServerlessLLMsurpasses state-of-the-art systems by 10 - 200X in latency performance whenrunning various LLM inference workloads.

Quick Read (beta)

loading the full paper ...