Abstract
Knowledge distillation can be a cost-effective technique to distill knowledgein Large Language Models, if the teacher output logits can be pre-computed andcached. However, successfully applying this to pre-training remains largelyunexplored. In this work, we prove that naive approaches for sparse knowledgedistillation such as caching Top-K probabilities, while intuitive, providebiased estimates of teacher probability distribution to the student, resultingin suboptimal performance and calibration. We propose animportance-sampling-based method `Random Sampling Knowledge Distillation',which provides unbiased estimates, preserves the gradient in expectation, andrequires storing significantly sparser logits. Our method enables fastertraining of student models with marginal overhead (<10%) compared tocross-entropy based training, while maintaining competitive performancecompared to full distillation, across a range of model sizes from 300M to 3B.