Abstract
The reasoning abilities of large language models (LLMs) have improved withchain-of-thought (CoT) prompting, allowing models to solve complex tasksstepwise. However, training CoT capabilities requires detailed reasoning data,which is often scarce. The self-taught reasoner (STaR) framework addresses thisby using reinforcement learning to automatically generate reasoning steps,reducing reliance on human-labeled data. Although STaR and its variants havedemonstrated empirical success, a theoretical foundation explaining theseimprovements is lacking. This work provides a theoretical framework forunderstanding the effectiveness of reinforcement learning on CoT reasoning andSTaR. Our contributions are: (1) criteria for the quality of pre-trained modelsnecessary to initiate effective reasoning improvement; (2) an analysis ofpolicy improvement, showing why LLM reasoning improves iteratively with STaR;(3) conditions for convergence to an optimal reasoning policy; and (4) anexamination of STaR's robustness, explaining how it can improve reasoning evenwhen incorporating occasional incorrect steps; This framework aims to bridgeempirical findings with theoretical insights, advancing reinforcement learningapproaches for reasoning in LLMs.