Abstract
Deep Reinforcement Learning (RL) is proven powerful for decision making insimulated environments. However, training deep RL model is challenging in realworld applications such as production-scale health-care or recommender systemsbecause of the expensiveness of interaction and limitation of budget atdeployment. One aspect of the data inefficiency comes from the expensivehyper-parameter tuning when optimizing deep neural networks. We proposeAdaptive Behavior Policy Sharing (ABPS), a data-efficient training algorithmthat allows sharing of experience collected by behavior policy that isadaptively selected from a pool of agents trained with an ensemble ofhyper-parameters. We further extend ABPS to evolve hyper-parameters duringtraining by hybridizing ABPS with an adapted version of Population BasedTraining (ABPS-PBT). We conduct experiments with multiple Atari games with upto 16 hyper-parameter/architecture setups. ABPS achieves superior overallperformance, reduced variance on top 25% agents, and equivalent performance onthe best agent compared to conventional hyper-parameter tuning with independenttraining, even though ABPS only requires the same number of environmentalinteractions as training a single agent. We also show that ABPS-PBT furtherimproves the convergence speed and reduces the variance.