Merge-of-Thought Distillation

Abstract

Efficient reasoning distillation for long chain-of-thought (CoT) models isincreasingly constrained by the assumption of a single oracle teacher, despitepractical availability of multiple candidate teachers and growing CoT corpora.We revisit teacher selection and observe that different students have different"best teachers," and even for the same student the best teacher can vary acrossdatasets. Therefore, to unify multiple teachers' reasoning abilities intostudent with overcoming conflicts among various teachers' supervision, wepropose Merge-of-Thought Distillation (MoT), a lightweight framework thatalternates between teacher-specific supervised fine-tuning branches andweight-space merging of the resulting student variants. On competition mathbenchmarks, using only about 200 high-quality CoT samples, applying MoT to aQwen3-14B student surpasses strong models including DEEPSEEK-R1, QWEN3-30B-A3B,QWEN3-32B, and OPENAI-O1, demonstrating substantial gains. Besides, MoTconsistently outperforms the best single-teacher distillation and the naivemulti-teacher union, raises the performance ceiling while mitigatingoverfitting, and shows robustness to distribution-shifted and peer-levelteachers. Moreover, MoT reduces catastrophic forgetting, improves generalreasoning beyond mathematics and even cultivates a better teacher, indicatingthat consensus-filtered reasoning features transfer broadly. These resultsposition MoT as a simple, scalable route to efficiently distilling long CoTcapabilities from diverse teachers into compact students.

Quick Read (beta)

loading the full paper ...