Abstract
This paper investigates Reinforcement Learning (RL) on data without explicitlabels for reasoning tasks in Large Language Models (LLMs). The core challengeof the problem is reward estimation during inference while not having access toground-truth information. While this setting appears elusive, we find thatcommon practices in Test-Time Scaling (TTS), such as majority voting, yieldsurprisingly effective rewards suitable for driving RL training. In this work,we introduce Test-Time Reinforcement Learning (TTRL), a novel method fortraining LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMsby utilizing the priors in the pre-trained models. Our experiments demonstratethat TTRL consistently improves performance across a variety of tasks andmodels. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B byapproximately 159% on the AIME 2024 with only unlabeled test data. Furthermore,although TTRL is only supervised by the Maj@N metric, TTRL has demonstratedperformance to consistently surpass the upper limit of the initial model, andapproach the performance of models trained directly on test data withground-truth labels. Our experimental findings validate the generaleffectiveness of TTRL across various tasks, and highlight TTRL's potential forbroader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL