Ctrl-Z: Recovering from Instability in Reinforcement Learning

Abstract

When learning behavior, training data is often generated by the learneritself; this can result in unstable training dynamics, and this problem hasparticularly important applications in safety-sensitive real-world controltasks such as robotics. In this work, we propose a principled andmodel-agnostic approach to mitigate the issue of unstable learning dynamics bymaintaining a history of a reinforcement learning agent over the course oftraining, and reverting to the parameters of a previous agent wheneverperformance significantly decreases. We develop techniques for evaluating thisperformance through statistical hypothesis testing of continued improvement,and evaluate them on a standard suite of challenging benchmark tasks involvingcontinuous control of simulated robots. We show improvements overstate-of-the-art reinforcement learning algorithms in performance androbustness to hyperparameters, outperforming DDPG in 5 out of 6 evaluationenvironments and showing no decrease in performance with TD3, which is known tobe relatively stable. In this way, our approach takes an important step towardsincreasing data efficiency and stability in training for real-world roboticapplications.

Quick Read (beta)

loading the full paper ...