Abstract
Auto-regressive language models (LMs) have been widely used to generate datain data-scarce domains to train new LMs, compensating for the scarcity ofreal-world data. Previous work experimentally found that LMs collapse whentrained on recursively generated data. This paper presents a theoretical proof:once a corpus (such as a subset of the World Wide Web) begins to incorporategenerated data and no new real-world data is added to the corpus, then nomatter how small the amount of data each LM generates and contributes to thecorpus, LM collapse is inevitable after sufficient time. This finding suggeststhat attempts to mitigate collapse by limiting the quantity of synthetic datain the corpus are fundamentally insufficient. Instead, avoiding collapse hingeson ensuring the quality of synthetic data.