Generalizing Across Multi-Objective Reward Functions in Deep Reinforcement Learning

Abstract

Many reinforcement-learning researchers treat the reward function as a partof the environment, meaning that the agent can only know the reward of a stateif it encounters that state in a trial run. However, we argue that this is anunnecessary limitation and instead, the reward function should be provided tothe learning algorithm. The advantage is that the algorithm can then use thereward function to check the reward for states that the agent hasn't evenencountered yet. In addition, the algorithm can simultaneously learn policiesfor multiple reward functions. For each state, the algorithm would calculatethe reward using each of the reward functions and add the rewards to itsexperience replay dataset. The Hindsight Experience Replay algorithm developedby Andrychowicz et al. (2017) does just this, and learns to generalize across adistribution of sparse, goal-based rewards. We extend this algorithm tolinearly-weighted, multi-objective rewards and learn a single policy that cangeneralize across all linear combinations of the multi-objective reward.Whereas other multi-objective algorithms teach the Q-function to generalizeacross the reward weights, our algorithm enables the policy to generalize, andcan thus be used with continuous actions.

Quick Read (beta)

loading the full paper ...