Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Abstract

Effective training of language models (LMs) for mathematical reasoning tasksdemands high-quality supervised fine-tuning data. Besides obtaining annotationsfrom human experts, a common alternative is sampling from larger and morepowerful LMs. However, this knowledge distillation approach can be costly andunstable, particularly when relying on closed-source, proprietary LMs likeGPT-4, whose behaviors are often unpredictable. In this work, we demonstratethat the reasoning abilities of small-scale LMs can be enhanced throughself-training, a process where models learn from their own outputs. We alsoshow that the conventional self-training can be further augmented by apreference learning algorithm called Direct Preference Optimization (DPO). Byintegrating DPO into self-training, we leverage preference data to guide LMstowards more accurate and diverse chain-of-thought reasoning. We evaluate ourmethod across various mathematical reasoning tasks using different base models.Our experiments show that this approach not only improves LMs' reasoningperformance but also offers a more cost-effective and scalable solutioncompared to relying on large proprietary LMs.

Quick Read (beta)

loading the full paper ...