Abstract
Proximal Policy Optimization (PPO) is widely regarded as one of the mostsuccessful deep reinforcement learning algorithms, known for its robustness andeffectiveness across a range of problems. The PPO objective encourages the importance ratio between the current andbehavior policies to move to the "right" direction -- starting from importancesampling ratios equal to 1, increasing the ratios for actions with positiveadvantages and decreasing those with negative advantages. A clipping functionis introduced to prevent over-optimization when updating the importance ratioin these "right" direction regions. Many PPO variants have been proposed toextend its success, most of which modify the objective's behavior by alteringthe clipping in the "right" direction regions. However, due to randomness inthe rollouts and stochasticity of the policy optimization, we observe that theratios frequently move to the "wrong" direction during the PPO optimization.This is a key factor hindering the improvement of PPO, but it has been largelyoverlooked. To address this, we propose the Directional-Clamp PPO algorithm(DClamp-PPO), which further penalizes the actions going to the strict "wrong"direction regions, where the advantage is positive (negative) and importanceratio falls below (above) $1 - \beta$ ($1+\beta$), for a tunable parameter $\beta \in (0, 1)$. The penalty is by enforcing asteeper loss slope, i.e., a clamp, in those regions. We demonstrate thatDClamp-PPO consistently outperforms PPO, as well as its variants, by focusingon modifying the objective's behavior in the "right" direction, across variousMuJoCo environments, using different random seeds. The proposed method isshown, both theoretically and empirically, to better avoid "wrong" directionupdates while keeping the importance ratio closer to 1.