Meta-Gradient Reinforcement Learning

Abstract

The goal of reinforcement learning algorithms is to estimate and/or optimisethe value function. However, unlike supervised learning, no teacher or oracleis available to provide the true value function. Instead, the majority ofreinforcement learning algorithms estimate and/or optimise a proxy for thevalue function. This proxy is typically based on a sampled and bootstrappedapproximation to the true value function, known as a return. The particularchoice of return is one of the chief components determining the nature of thealgorithm: the rate at which future rewards are discounted; when and how valuesshould be bootstrapped; or even the nature of the rewards themselves. It iswell-known that these decisions are crucial to the overall success of RLalgorithms. We discuss a gradient-based meta-learning algorithm that is able toadapt the nature of the return, online, whilst interacting and learning fromthe environment. When applied to 57 games on the Atari 2600 environment over200 million frames, our algorithm achieved a new state-of-the-art performance.

Quick Read (beta)

loading the full paper ...