Abstract
We introduce Proximal Policy Distillation (PPD), a novel policy distillationmethod that integrates student-driven distillation and Proximal PolicyOptimization (PPO) to increase sample efficiency and to leverage the additionalrewards that the student policy collects during distillation. To assess theefficacy of our method, we compare PPD with two common alternatives,student-distill and teacher-distill, over a wide range of reinforcementlearning environments that include discrete actions and continuous control(ATARI, Mujoco, and Procgen). For each environment and method, we performdistillation to a set of target student neural networks that are smaller,identical (self-distillation), or larger than the teacher network. Our findingsindicate that PPD improves sample efficiency and produces better studentpolicies compared to typical policy distillation approaches. Moreover, PPDdemonstrates greater robustness than alternative methods when distillingpolicies from imperfect demonstrations. The code for the paper is released aspart of a new Python library built on top of stable-baselines3 to facilitatepolicy distillation: `sb3-distill'.