On the Generalization vs Fidelity Paradox in Knowledge Distillation

Abstract

Knowledge distillation (KD) is a key technique for compressing large languagemodels into smaller ones while preserving performance. Despite the recenttraction of KD research, its effectiveness for smaller language models (LMs)and the mechanisms driving knowledge transfer remain underexplored. In thiswork, we present the first large-scale empirical and statistical analysis of KDacross models ranging from 0.5B to 7B parameters on 14 complex reasoning tasksin a zero-shot setting. Our findings reveal that KD can improve the averageperformance of smaller models by up to $10\%$, with a peak task specific gainof $22\%$, while providing only marginal benefits ($\sim 1.3\%$) for largermodels. Surprisingly, teacher performance has a minimal impact on studentoutcomes, while teacher task expertise impacts KD effectiveness. A correlationstudy indicates that smaller LMs benefit more from KD, whereas larger LMs showdiminished gains. Additionally, we uncover a misalignment between improvementsin student performance and reasoning fidelity, suggesting that while KDenhances accuracy, it does not always maintain the structured decision-makingprocesses of the teacher. Our ablation study further highlights the importanceof teacher signals and logit smoothing in influencing students' performanceafter distillation. Overall, our study offers a comprehensive empirical andstatistical assessment of KD, highlighting both its benefits and trade-offswhen distilling knowledge from larger to smaller LMs.

Quick Read (beta)

loading the full paper ...