Abstract
Next-token prediction (NTP) is the cornerstone of modern large languagemodels (LLMs) pretraining, driving their unprecedented capabilities in textgeneration, reasoning, and instruction following. However, the token-levelprediction limits the model's capacity to capture higher-level semanticstructures and long-range contextual relationships. To overcome thislimitation, we introduce \textbf{ContextLM}, a framework that augments standardpretraining with an inherent \textbf{next-context prediction} objective. Thismechanism trains the model to learn predictive representations of multi-tokencontexts, leveraging error signals derived from future token chunks. Crucially,ContextLM achieves this enhancement while remaining fully compatible with thestandard autoregressive, token-by-token evaluation paradigm (e.g., perplexity).Extensive experiments on the GPT2 and Pythia model families, scaled up to$1.5$B parameters, show that ContextLM delivers consistent improvements in bothperplexity and downstream task performance. Our analysis indicates thatnext-context prediction provides a scalable and efficient pathway to strongerlanguage modeling, yielding better long-range coherence and more effectiveattention allocation with minimal computational overhead.