Mentor-KD: Making Small Language Models Better Multi-step Reasoners

Abstract

Large Language Models (LLMs) have displayed remarkable performances acrossvarious complex tasks by leveraging Chain-of-Thought (CoT) prompting. Recently,studies have proposed a Knowledge Distillation (KD) approach, reasoningdistillation, which transfers such reasoning ability of LLMs throughfine-tuning language models of multi-step rationales generated by LLM teachers.However, they have inadequately considered two challenges regardinginsufficient distillation sets from the LLM teacher model, in terms of 1) dataquality and 2) soft label provision. In this paper, we propose Mentor-KD, whicheffectively distills the multi-step reasoning capability of LLMs to smaller LMswhile addressing the aforementioned challenges. Specifically, we exploit amentor, intermediate-sized task-specific fine-tuned model, to augmentadditional CoT annotations and provide soft labels for the student model duringreasoning distillation. We conduct extensive experiments and confirmMentor-KD's effectiveness across various models and complex reasoning tasks.

Quick Read (beta)

loading the full paper ...