Long-Term Visitation Value for Deep Exploration in Sparse Reward Reinforcement Learning

Abstract

Reinforcement learning with sparse rewards is still an open challenge.Classic methods rely on getting feedback via extrinsic rewards to train theagent, and in situations where this occurs very rarely the agent learns slowlyor cannot learn at all. Similarly, if the agent receives also rewards thatcreate suboptimal modes of the objective function, it will likely prematurelystop exploring. More recent methods add auxiliary intrinsic rewards toencourage exploration. However, auxiliary rewards lead to a non-stationarytarget for the Q-function. In this paper, we present a novel approach that (1)plans exploration actions far into the future by using a long-term visitationcount, and (2) decouples exploration and exploitation by learning a separatefunction assessing the exploration value of the actions. Contrary to existingmethods which use models of reward and dynamics, our approach is off-policy andmodel-free. We further propose new tabular environments for benchmarkingexploration in reinforcement learning. Empirical results on classic and novelbenchmarks show that the proposed approach outperforms existing methods inenvironments with sparse rewards, especially in the presence of rewards thatcreate suboptimal modes of the objective function. Results also suggest thatour approach scales gracefully with the size of the environment. Source code isavailable at https://github.com/sparisi/visit-value-explore

Quick Read (beta)

loading the full paper ...