LLM Inference Unveiled: Survey and Roofline Model Insights

Abstract

The field of efficient Large Language Model (LLM) inference is rapidlyevolving, presenting a unique blend of opportunities and challenges. Althoughthe field has expanded and is vibrant, there hasn't been a concise frameworkthat analyzes the various methods of LLM Inference to provide a clearunderstanding of this domain. Our survey stands out from traditional literaturereviews by not only summarizing the current state of research but also byintroducing a framework based on roofline model for systematic analysis of LLMinference techniques. This framework identifies the bottlenecks when deployingLLMs on hardware devices and provides a clear understanding of practicalproblems, such as why LLMs are memory-bound, how much memory and computationthey need, and how to choose the right hardware. We systematically collate thelatest advancements in efficient LLM inference, covering crucial areas such asmodel compression (e.g., Knowledge Distillation and Quantization), algorithmimprovements (e.g., Early Exit and Mixture-of-Expert), and both hardware andsystem-level enhancements. Our survey stands out by analyzing these methodswith roofline model, helping us understand their impact on memory access andcomputation. This distinctive approach not only showcases the current researchlandscape but also delivers valuable insights for practical implementation,positioning our work as an indispensable resource for researchers new to thefield as well as for those seeking to deepen their understanding of efficientLLM deployment. The analyze tool LLM-Viewer is open-sourced.

Quick Read (beta)

loading the full paper ...