Abstract
We prove a new asymptotic equipartition property for the perplexity of longtexts generated by a language model and present supporting experimentalevidence from open-source models. Specifically we show that the logarithmicperplexity of any large text generated by a language model must asymptoticallyconverge to the average entropy of its token distributions. This defines a"typical set" that all long synthetic texts generated by a language model mustbelong to. We show that this typical set is a vanishingly small subset of allpossible grammatically correct outputs. These results suggest possibleapplications to important practical problems such as (a) detecting syntheticAI-generated text, and (b) testing whether a text was used to train a languagemodel. We make no simplifying assumptions (such as stationarity) about thestatistics of language model outputs, and therefore our results are directlyapplicable to practical real-world models without any approximations.