Abstract
Fine-tuning large language models (LLMs) for downstream tasks can greatlyimprove model quality, however serving many different fine-tuned LLMsconcurrently for users in multi-tenant environments is challenging. DedicatingGPU memory for each model is prohibitively expensive and naively swapping largemodel weights in and out of GPU memory is slow. Our key insight is thatfine-tuned models can be quickly swapped in and out of GPU memory by extractingand compressing the delta between each model and its pre-trained base model. Wepropose DeltaZip, an LLM serving system that efficiently serves multiplefull-parameter fine-tuned models concurrently by aggressively compressing modeldeltas by a factor of $6\times$ to $8\times$ while maintaining high modelquality. DeltaZip increases serving throughput by $1.5\times$ to $3\times$ andimproves SLO attainment compared to a vanilla HuggingFace serving system.