DEFT: Diverse Ensembles for Fast Transfer in Reinforcement Learning

Abstract

Deep ensembles have been shown to extend the positive effect seen in typicalensemble learning to neural networks and to reinforcement learning (RL).However, there is still much to be done to improve the efficiency of suchensemble models. In this work, we present Diverse Ensembles for Fast Transferin RL (DEFT), a new ensemble-based method for reinforcement learning in highlymultimodal environments and improved transfer to unseen environments. Thealgorithm is broken down into two main phases: training of ensemble members,and synthesis (or fine-tuning) of the ensemble members into a policy that worksin a new environment. The first phase of the algorithm involves training regular policy gradient oractor-critic agents in parallel but adding a term to the loss that encouragesthese policies to differ from each other. This causes the individual unimodalagents to explore the space of optimal policies and capture more of themultimodality of the environment than a single actor could. The second phase ofDEFT involves synthesizing the component policies into a new policy that workswell in a modified environment in one of two ways. To evaluate the performanceof DEFT, we start with a base version of the Proximal Policy Optimization (PPO)algorithm and extend it with the modifications for DEFT. Our results show thatthe pretraining phase is effective in producing diverse policies in multimodalenvironments. DEFT often converges to a high reward significantly faster thanalternatives, such as random initialization without DEFT and fine-tuning ofensemble members. While there is certainly more work to be done to analyze DEFT theoreticallyand extend it to be even more robust, we believe it provides a strong frameworkfor capturing multimodality in environments while still using RL methods withsimple policy representations.

Quick Read (beta)

loading the full paper ...