XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

  • 2024-11-27 18:59:28
  • Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, Tianqi Chen
  • 0

Abstract

The applications of LLM Agents are becoming increasingly complex and diverse,leading to a high demand for structured outputs that can be parsed into code,structured function calls, and embodied agent commands. These developmentsbring significant demands for structured generation in LLM inference.Context-free grammar is a flexible approach to enable structured generation viaconstrained decoding. However, executing context-free grammar requires goingthrough several stack states over all tokens in vocabulary during runtime,bringing non-negligible overhead for structured generation. In this paper, wepropose XGrammar, a flexible and efficient structure generation engine forlarge language models. XGrammar accelerates context-free grammar execution bydividing the vocabulary into context-independent tokens that can be precheckedand context-dependent tokens that need to be interpreted during runtime. Wefurther build transformations to expand the grammar context and reduce thenumber of context-independent tokens. Additionally, we build an efficientpersistent stack to accelerate the context-dependent token checks. Finally, weco-design the grammar engine with LLM inference engine to overlap grammarcomputation with GPU executions. Evaluation results show that XGrammar canachieve up to 100x speedup over existing solutions. Combined with an LLMinference engine, it can generate near-zero overhead structure generation inend-to-end low-LLM serving.

 

Quick Read (beta)

loading the full paper ...