Abstract
Instruction tuning has been widely adopted to ensure large language models(LLMs) follow user instructions effectively. The resultinginstruction-following capabilities of LLMs heavily rely on the instructiondatasets used for tuning. Recently, synthetic instruction datasets have emergedas an economically viable solution to provide LLMs diverse and high-qualityinstructions. However, existing approaches typically assume that larger orstronger models are stronger teachers for instruction tuning, and hence simplyadopt these models as response generators to the synthetic instructions. Inthis paper, we challenge this commonly-adopted assumption. Our extensiveexperiments across five base models and twenty response generators reveal thatlarger and stronger models are not necessarily stronger teachers of smallermodels. We refer to this phenomenon as the Larger Models' Paradox. We observethat existing metrics cannot precisely predict the effectiveness of responsegenerators since they ignore the compatibility between teachers and base modelsbeing fine-tuned. We thus develop a novel metric, named asCompatibility-Adjusted Reward (CAR) to measure the effectiveness of responsegenerators. Our experiments across five base models demonstrate that CARoutperforms almost all baselines.