Abstract
The high memory and computation demand of large language models (LLMs) makesthem challenging to be deployed on consumer devices due to limited GPU memory.Offloading can mitigate the memory constraint but often suffers from low GPUutilization, leading to low inference efficiency. In this work, we propose anovel framework, called pipelined offloading (PIPO), for efficient inference onconsumer devices. PIPO designs a fine-grained offloading pipeline, complementedwith optimized data transfer and computation, to achieve high concurrency andefficient scheduling for inference. Experimental results show that comparedwith state-of-the-art baseline, PIPO increases GPU utilization from below 40%to over 90% and achieves up to 3.1$\times$ higher throughput, running on alaptop equipped with a RTX3060 GPU of 6GB memory.