Farseer: A Refined Scaling Law in Large Language Models

Abstract

Training Large Language Models (LLMs) is prohibitively expensive, creating acritical scaling gap where insights from small-scale experiments often fail totransfer to resource-intensive production systems, thereby hindering efficientinnovation. To bridge this, we introduce Farseer, a novel and refined scalinglaw offering enhanced predictive accuracy across scales. By systematicallyconstructing a model loss surface $L(N,D)$, Farseer achieves a significantlybetter fit to empirical data than prior laws (e.g., Chinchilla's law). Ourmethodology yields accurate, robust, and highly generalizable predictions,demonstrating excellent extrapolation capabilities, improving upon Chinchilla'slaw by reducing extrapolation error by 433\%. This allows for the reliableevaluation of competing training strategies across all $(N,D)$ settings,enabling conclusions from small-scale ablation studies to be confidentlyextrapolated to predict large-scale performance. Furthermore, Farseer providesnew insights into optimal compute allocation, better reflecting the nuanceddemands of modern LLM training. To validate our approach, we trained anextensive suite of approximately 1,000 LLMs across diverse scales andconfigurations, consuming roughly 3 million NVIDIA H100 GPU hours. We arecomprehensively open-sourcing all models, data, results, and logs athttps://github.com/Farseer-Scaling-Law/Farseer to foster further research.

Quick Read (beta)

loading the full paper ...