Abstract
Recent advancements in post-training methodologies for large language models(LLMs) have highlighted reinforcement learning (RL) as a critical component forenhancing reasoning. However, the substantial computational costs associatedwith RL-based approaches have led to growing interest in alternative paradigms,such as Direct Preference Optimization (DPO). In this study, we investigate theeffectiveness of DPO in facilitating self-improvement for LLMs throughiterative preference-based learning. We demonstrate that a single round of DPOwith coarse filtering significantly enhances mathematical reasoningperformance, particularly for strong base model. Furthermore, we design aniterative enhancement framework for both the generator and the reward model(RM), enabling their mutual improvement through online interaction acrossmultiple rounds of DPO. Finally, with simple verifiable rewards, our modelDPO-VP achieves RL-level performance with significantly lower computationaloverhead. These findings highlight DPO as a scalable and cost-effectivealternative to RL, offering a practical solution for enhancing LLM reasoning inresource-constrained situations.