Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

Abstract

Are $n$-gram language models still relevant in this era of neural largelanguage models (LLMs)? Our answer is yes, and we showcase their values in bothtext analysis and improving neural LLMs. This was done by modernizing $n$-gramLMs in two aspects. First, we train them at the same data scale as neural LLMs-- 5 trillion tokens. This is the largest $n$-gram LM ever built. Second,existing $n$-gram LMs use small $n$ which hinders their performance; we insteadallow $n$ to be arbitrarily large, by introducing a new $\infty$-gram LM withbackoff. Instead of pre-computing $n$-gram count tables (which would be veryexpensive), we develop an engine named infini-gram -- powered by suffix arrays-- that can compute $\infty$-gram (as well as $n$-gram with arbitrary $n$)probabilities with millisecond-level latency. The $\infty$-gram framework andinfini-gram engine enable us to conduct many novel and interesting analyses ofhuman-written and machine-generated text: we find that the $\infty$-gram LM hasfairly high accuracy for next-token prediction (47%), and can complement neuralLLMs to greatly reduce their perplexity. When analyzing machine-generated text,we also observe irregularities in the machine--$\infty$-gram agreement levelwith respect to the suffix length, which indicates deficiencies in neural LLMpretraining and the positional embeddings of Transformers.

Quick Read (beta)

loading the full paper ...