Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities acrossnumerous tasks, yet principled explanations for their underlying mechanisms andseveral phenomena, such as scaling laws, hallucinations, and related behaviors,remain elusive. In this work, we revisit the classical relationship betweencompression and prediction, grounded in Kolmogorov complexity and Shannoninformation theory, to provide deeper insights into LLM behaviors. Byleveraging the Kolmogorov Structure Function and interpreting LLM compressionas a two-part coding process, we offer a detailed view of how LLMs acquire andstore information across increasing model and data scales -- from pervasivesyntactic patterns to progressively rarer knowledge elements. Motivated by thistheoretical perspective and natural assumptions inspired by Heap's and Zipf'slaws, we introduce a simplified yet representative hierarchical data-generationframework called the Syntax-Knowledge model. Under the Bayesian setting, weshow that prediction and compression within this model naturally lead todiverse learning and scaling behaviors of LLMs. In particular, our theoreticalanalysis offers intuitive and principled explanations for both data and modelscaling laws, the dynamics of knowledge acquisition during training andfine-tuning, factual knowledge hallucinations in LLMs. The experimental resultsvalidate our theoretical predictions.