Explaining Context Length Scaling and Bounds for Language Models

Abstract

Long Context Language Models have drawn great attention in the past fewyears. There has been work discussing the impact of long context on LanguageModel performance: some find that long irrelevant context could harmperformance, while some experimentally summarize loss reduction by relevantlong context as Scaling Laws. This calls for a more thorough understanding onhow long context impact Language Modeling. In this work, we (1) propose a cleanand effective theoretical framework on explaining the impact of context lengthto Language Modeling, from an Intrinsic Space perspective; and (2) conductexperiments on natural language and synthetic data, validating our proposedtheoretical assumptions and deductions. Our theoretical framework can providepractical insights such as establishing that training dataset size dictates anoptimal context length and bounds context length scaling for certain case. Wehope our work may inspire new long context Language Models, as well as futurework studying Physics for Language Models. Code for our experiments isavailable at this url: https://github.com/JingzheShi/NLPCtlScalingAndBounds.

Quick Read (beta)

loading the full paper ...