Abstract
This paper introduces two novel modifications to the Dynamic sAmpling PolicyOptimization (DAPO) algorithm [1], approached from a mixed-policy perspective.Standard policy gradient methods can suffer from instability and sampleinefficiency, particularly in sparse reward settings. To address this, we firstpropose a method that incorporates a pre-trained, stable guiding policy($\piphi$) to provide off-policy experience, thereby regularizing the trainingof the target policy ($\pion$). This approach improves training stability andconvergence speed by adaptively adjusting the learning step size. Secondly, weextend this idea to re-utilize zero-reward samples, which are often discardedby dynamic sampling strategies like DAPO's. By treating these samples as adistinct batch guided by the expert policy, we further enhance sampleefficiency. We provide a theoretical analysis for both methods, demonstratingthat their objective functions converge to the optimal solution within theestablished theoretical framework of reinforcement learning. The proposedmixed-policy framework effectively balances exploration and exploitation,promising more stable and efficient policy optimization.