Learning Rewards to Optimize Global Performance Metrics in Deep Reinforcement Learning

Abstract

When applying reinforcement learning (RL) to a new problem, rewardengineering is a necessary, but often difficult and error-prone task a systemdesigner has to face. To avoid this step, we propose LR4GPM, a novel (deep) RLmethod that can optimize a global performance metric, which is supposed to beavailable as part of the problem description. LR4GPM alternates between twophases: (1) learning a (possibly vector) reward function used to fit theperformance metric, and (2) training a policy to optimize an approximation ofthis performance metric based on the learned rewards. Such RL training is notstraightforward since both the reward function and the policy are trained usingnon-stationary data. To overcome this issue, we propose several trainingtricks. We demonstrate the efficiency of LR4GPM on several domains. Notably,LR4GPM outperforms the winner of a recent autonomous driving competitionorganized at DAI'2020.

Quick Read (beta)

loading the full paper ...