Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective

Abstract

OpenAI o1 represents a significant milestone in Artificial Inteiligence,which achieves expert-level performances on many challanging tasks that requirestrong reasoning ability.OpenAI has claimed that the main techinique behinds o1is the reinforcement learining. Recent works use alternative approaches likeknowledge distillation to imitate o1's reasoning style, but their effectivenessis limited by the capability ceiling of the teacher model. Therefore, thispaper analyzes the roadmap to achieving o1 from the perspective ofreinforcement learning, focusing on four key components: policy initialization,reward design, search, and learning. Policy initialization enables models todevelop human-like reasoning behaviors, equipping them with the ability toeffectively explore solution spaces for complex problems. Reward designprovides dense and effective signals via reward shaping or reward modeling,which is the guidance for both search and learning. Search plays a crucial rolein generating high-quality solutions during both training and testing phases,which can produce better solutions with more computation. Learning utilizes thedata generated by search for improving policy, which can achieve the betterperformance with more parameters and more searched data. Existing open-sourceprojects that attempt to reproduce o1 can be seem as a part or a variant of ourroadmap. Collectively, these components underscore how learning and searchdrive o1's advancement, making meaningful contributions to the development ofLLM.

Quick Read (beta)

loading the full paper ...