How Ensembles of Distilled Policies Improve Generalisation in Reinforcement Learning

Abstract

In the zero-shot policy transfer setting in reinforcement learning, the goalis to train an agent on a fixed set of training environments so that it cangeneralise to similar, but unseen, testing environments. Previous work hasshown that policy distillation after training can sometimes produce a policythat outperforms the original in the testing environments. However, it is notyet entirely clear why that is, or what data should be used to distil thepolicy. In this paper, we prove, under certain assumptions, a generalisationbound for policy distillation after training. The theory provides two practicalinsights: for improved generalisation, you should 1) train an ensemble ofdistilled policies, and 2) distil it on as much data from the trainingenvironments as possible. We empirically verify that these insights hold inmore general settings, when the assumptions required for the theory no longerhold. Finally, we demonstrate that an ensemble of policies distilled on adiverse dataset can generalise significantly better than the original agent.

Quick Read (beta)

loading the full paper ...