Discovering Reinforcement Learning Algorithms

Abstract

Reinforcement learning (RL) algorithms update an agent's parameters accordingto one of several possible rules, discovered manually through years ofresearch. Automating the discovery of update rules from data could lead to moreefficient algorithms, or algorithms that are better adapted to specificenvironments. Although there have been prior attempts at addressing thissignificant scientific challenge, it remains an open question whether it isfeasible to discover alternatives to fundamental concepts of RL such as valuefunctions and temporal-difference learning. This paper introduces a newmeta-learning approach that discovers an entire update rule which includes both'what to predict' (e.g. value functions) and 'how to learn from it' (e.g.bootstrapping) by interacting with a set of environments. The output of thismethod is an RL algorithm that we call Learned Policy Gradient (LPG). Empiricalresults show that our method discovers its own alternative to the concept ofvalue functions. Furthermore it discovers a bootstrapping mechanism to maintainand use its predictions. Surprisingly, when trained solely on toy environments,LPG generalises effectively to complex Atari games and achieves non-trivialperformance. This shows the potential to discover general RL algorithms fromdata.

Quick Read (beta)

loading the full paper ...