Cyclic Policy Distillation: Sample-Efficient Sim-to-Real Reinforcement Learning with Domain Randomization

Abstract

Deep reinforcement learning with domain randomization learns a control policyin various simulations with randomized physical and sensor model parameters tobecome transferable to the real world in a zero-shot setting. However, a hugenumber of samples are often required to learn an effective policy when therange of randomized parameters is extensive due to the instability of policyupdates. To alleviate this problem, we propose a sample-efficient method namedCyclic Policy Distillation (CPD). CPD divides the range of randomizedparameters into several small sub-domains and assigns a local policy to eachsub-domain. Then, the learning of local policies is performed while {\itcyclically} transitioning the target sub-domain to neighboring sub-domains andexploiting the learned values/policies of the neighbor sub-domains with amonotonic policy-improvement scheme. Finally, all of the learned local policiesare distilled into a global policy for sim-to-real transfer. The effectivenessand sample efficiency of CPD are demonstrated through simulations with fourtasks (Pendulum from OpenAIGym and Pusher, Swimmer, and HalfCheetah fromMujoco), and a real-robot ball-dispersal task.

Quick Read (beta)

loading the full paper ...