Relative Scaling Laws for LLMs

Abstract

Scaling laws describe how language models improve with additional data,parameters, and compute. While widely used, they are typically measured onaggregate test sets. Aggregate evaluations yield clean trends but average overheterogeneous subpopulations, obscuring performance disparities. We introducerelative scaling laws, which track how performance gaps between testdistributions evolve with scale rather than focusing solely on absolute error.Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP)budgets from $10^{18}$--$10^{20}$ FLOPs on standard pretraining datasets, wefind diverse trajectories: academic domains on MMLU converge toward parity;regional English dialects shift depending on population size; and clusters ofAI risk behaviours split, with capability- and influence-related risksincreasing during pretraining while adversarial risks do not. These resultsshow that although scaling improves overall performance, it is not a universalequalizer. To support further study, we release all model checkpoints from thiswork to enable practitioners to measure relative alongside traditional scalinglaws, in order to better prioritize robustness challenges in light of thebitter lesson.

Quick Read (beta)

loading the full paper ...