Data Efficient Training for Reinforcement Learning with Adaptive Behavior Policy Sharing

Abstract

Deep Reinforcement Learning (RL) is proven powerful for decision making insimulated environments. However, training deep RL model is challenging in realworld applications such as production-scale health-care or recommender systemsbecause of the expensiveness of interaction and limitation of budget atdeployment. One aspect of the data inefficiency comes from the expensivehyper-parameter tuning when optimizing deep neural networks. We proposeAdaptive Behavior Policy Sharing (ABPS), a data-efficient training algorithmthat allows sharing of experience collected by behavior policy that isadaptively selected from a pool of agents trained with an ensemble ofhyper-parameters. We further extend ABPS to evolve hyper-parameters duringtraining by hybridizing ABPS with an adapted version of Population BasedTraining (ABPS-PBT). We conduct experiments with multiple Atari games with upto 16 hyper-parameter/architecture setups. ABPS achieves superior overallperformance, reduced variance on top 25% agents, and equivalent performance onthe best agent compared to conventional hyper-parameter tuning with independenttraining, even though ABPS only requires the same number of environmentalinteractions as training a single agent. We also show that ABPS-PBT furtherimproves the convergence speed and reduces the variance.

Quick Read (beta)

loading the full paper ...