AntBatchInfer: Elastic Batch Inference in the Kubernetes Cluster

Abstract

Offline batch inference is a common task in the industry for deep learningapplications, but it can be challenging to ensure stability and performancewhen dealing with large amounts of data and complicated inference pipelines.This paper demonstrated AntBatchInfer, an elastic batch inference framework,which is specially optimized for the non-dedicated cluster. AntBatchInferaddresses these challenges by providing multi-level fault-tolerantcapabilities, enabling the stable execution of versatile and long-runninginference tasks. It also improves inference efficiency by pipelining,intra-node, and inter-node scaling. It further optimizes the performance incomplicated multiple-model batch inference scenarios. Through extensiveexperiments and real-world statistics, we demonstrate the superiority of ourframework in terms of stability and efficiency. In the experiment, itoutperforms the baseline by at least $2\times$ and $6\times$ in thesingle-model or multiple-model batch inference. Also, it is widely used at AntGroup, with thousands of daily jobs from various scenarios, including DLRM, CV,and NLP, which proves its practicability in the industry.

Quick Read (beta)

loading the full paper ...