Small Models Struggle to Learn from Strong Reasoners

Abstract

Large language models (LLMs) excel in complex reasoning tasks, and distillingtheir reasoning capabilities into smaller models has shown promise. However, weuncover an interesting phenomenon, which we term the Small Model LearnabilityGap: small models ($\leq$3B parameters) do not consistently benefit from longchain-of-thought (CoT) reasoning or distillation from larger models. Instead,they perform better when fine-tuned on shorter, simpler reasoning chains thatbetter align with their intrinsic learning capacity. To address this, wepropose Mix Distillation, a simple yet effective strategy that balancesreasoning complexity by combining long and short CoT examples or reasoning fromboth larger and smaller models. Our experiments demonstrate that MixDistillation significantly improves small model reasoning performance comparedto training on either data alone. These findings highlight the limitations ofdirect strong model distillation and underscore the importance of adaptingreasoning complexity for effective reasoning capability transfer.

Quick Read (beta)

loading the full paper ...