Sample-Efficient Reinforcement Learning with Maximum Entropy Mellowmax Episodic Control

Abstract

Deep networks have enabled reinforcement learning to scale to more complexand challenging domains, but these methods typically require large quantitiesof training data. An alternative is to use sample-efficient episodic controlmethods: neuro-inspired algorithms which use non-/semi-parametric models thatpredict values based on storing and retrieving previously experiencedtransitions. One way to further improve the sample efficiency of theseapproaches is to use more principled exploration strategies. In this work, wetherefore propose maximum entropy mellowmax episodic control (MEMEC), whichsamples actions according to a Boltzmann policy with a state-dependenttemperature. We demonstrate that MEMEC outperforms other uncertainty- andsoftmax-based exploration methods on classic reinforcement learningenvironments and Atari games, achieving both more rapid learning and higherfinal rewards.

Quick Read (beta)

loading the full paper ...