Accelerating Reinforcement Learning with Suboptimal Guidance

Abstract

Reinforcement Learning in domains with sparse rewards is a difficult problem,and a large part of the training process is often spent searching the statespace in a more or less random fashion for any learning signals. For controlproblems, we often have some controller readily available which might besuboptimal but nevertheless solves the problem to some degree. This controllercan be used to guide the initial exploration phase of the learning controllertowards reward yielding states, reducing the time before refinement of a viablepolicy can be initiated. In our work, the agent is guided through an auxiliarybehaviour cloning loss which is made conditional on a Q-filter, i.e. it is onlyapplied in situations where the critic deems the guiding controller to bebetter than the agent. The Q-filter provides a natural way to adjust theguidance throughout the training process, allowing the agent to exceed theguiding controller in a manner that is adaptive to the task at hand and theproficiency of the guiding controller. The contribution of this paper lies inidentifying shortcomings in previously proposed implementations of the Q-filterconcept, and in suggesting some ways these issues can be mitigated. Thesemodifications are tested on the OpenAI Gym Fetch environments, showing clearimprovements in adaptivity and yielding increased performance in all roboticenvironments tested.

Quick Read (beta)

loading the full paper ...