Abstract
The reasoning abilities of large language models (LLMs) have improved withchain-of-thought (CoT) prompting, allowing models to solve complex tasks in astepwise manner. However, training CoT capabilities requires detailed reasoningdata, which is often scarce. The self-taught reasoner (STaR) frameworkaddresses this by using reinforcement learning to automatically generatereasoning steps, reducing reliance on human-labeled data. Although STaR and itsvariants have demonstrated empirical success, a theoretical foundationexplaining these improvements is lacking. This work provides a theoreticalframework for understanding the effectiveness of reinforcement learning on CoTreasoning and STaR. Our contributions are: (1) an analysis of policyimprovement, showing why LLM reasoning improves iteratively with STaR; (2)conditions for convergence to an optimal reasoning policy; (3) an examinationof STaR's robustness, explaining how it can improve reasoning even whenincorporating occasional incorrect steps; and (4) criteria for the quality ofpre-trained models necessary to initiate effective reasoning improvement. Thisframework aims to bridge empirical findings with theoretical insights,advancing reinforcement learning approaches for reasoning in LLMs.