RL-STaR: Theoretical Analysis of Reinforcement Learning Frameworks for Self-Taught Reasoner

Abstract

The reasoning abilities of large language models (LLMs) have improved withchain-of-thought (CoT) prompting, allowing models to solve complex tasks in astepwise manner. However, training CoT capabilities requires detailed reasoningdata, which is often scarce. The self-taught reasoner (STaR) frameworkaddresses this by using reinforcement learning to automatically generatereasoning steps, reducing reliance on human-labeled data. Although STaR and itsvariants have demonstrated empirical success, a theoretical foundationexplaining these improvements is lacking. This work provides a theoreticalframework for understanding the effectiveness of reinforcement learning on CoTreasoning and STaR. Our contributions are: (1) an analysis of policyimprovement, showing why LLM reasoning improves iteratively with STaR; (2)conditions for convergence to an optimal reasoning policy; (3) an examinationof STaR's robustness, explaining how it can improve reasoning even whenincorporating occasional incorrect steps; and (4) criteria for the quality ofpre-trained models necessary to initiate effective reasoning improvement. Thisframework aims to bridge empirical findings with theoretical insights,advancing reinforcement learning approaches for reasoning in LLMs.

Quick Read (beta)

loading the full paper ...