Reinforcement Pre-Training

Abstract

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scalingparadigm for large language models and reinforcement learning (RL).Specifically, we reframe next-token prediction as a reasoning task trainedusing RL, where it receives verifiable rewards for correctly predicting thenext token for a given context. RPT offers a scalable method to leverage vastamounts of text data for general-purpose RL, rather than relying ondomain-specific annotated answers. By incentivizing the capability ofnext-token reasoning, RPT significantly improves the language modeling accuracyof predicting the next tokens. Moreover, RPT provides a strong pre-trainedfoundation for further reinforcement fine-tuning. The scaling curves show thatincreased training compute consistently improves the next-token predictionaccuracy. The results position RPT as an effective and promising scalingparadigm to advance language model pre-training.

Quick Read (beta)

loading the full paper ...