Abstract
Language model (LM) distillation aims at distilling the knowledge in a largeteacher LM to a small student one. As a critical issue facing LM distillation,a superior student often arises from a teacher of a relatively small scaleinstead of a larger one, especially in the presence of substantial capacity gapbetween the teacher and student. This issue, often referred to as the\textit{curse of capacity gap}, suggests that there is likely an optimalteacher yielding the best-performing student along the scaling course of theteacher. Consequently, distillation trials on teachers of a wide range ofscales are called for to determine the optimal teacher, which becomescomputationally intensive in the context of large LMs (LLMs). This paperaddresses this critical bottleneck by providing the \textit{law of capacitygap} inducted from a preliminary study on distilling a broad range ofsmall-scale (<3B) LMs, where the optimal teacher consistently scales linearlywith the student scale across different model and data scales. By extending thelaw to LLM distillation on a larger scale (7B), we succeed in obtainingversatile LLMs that outperform a wide array of competitors.