Abstract
Causal language models have demonstrated remarkable capabilities, but theirsize poses significant challenges for deployment in resource-constrainedenvironments. Knowledge distillation, a widely-used technique for transferringknowledge from a large teacher model to a small student model, presents apromising approach for model compression. A significant remaining issue lies inthe major differences between teacher and student models, namely thesubstantial capacity gap, mode averaging, and mode collapse, which posebarriers during distillation. To address these issues, we introduce$\textit{Temporally Adaptive Interpolated Distillation (TAID)}$, a novelknowledge distillation approach that dynamically interpolates student andteacher distributions through an adaptive intermediate distribution, graduallyshifting from the student's initial distribution towards the teacher'sdistribution. We provide a theoretical analysis demonstrating TAID's ability toprevent mode collapse and empirically show its effectiveness in addressing thecapacity gap while balancing mode averaging and mode collapse. Ourcomprehensive experiments demonstrate TAID's superior performance acrossvarious model sizes and architectures in both instruction tuning andpre-training scenarios. Furthermore, we showcase TAID's practical impact bydeveloping two state-of-the-art compact foundation models:$\texttt{TAID-LLM-1.5B}$ for language tasks and $\texttt{TAID-VLM-2B}$ forvision-language tasks. These results demonstrate TAID's effectiveness increating high-performing and efficient models, advancing the development ofmore accessible AI technologies.