Abstract
Potential-based reward shaping is commonly used to incorporate priorknowledge of how to solve the task into reinforcement learning because it canformally guarantee policy invariance. As such, the optimal policy and theordering of policies by their returns are not altered by potential-based rewardshaping. In this work, we highlight the dependence of effective potential-basedreward shaping on the initial Q-values and external rewards, which determinethe agent's ability to exploit the shaping rewards to guide its exploration andachieve increased sample efficiency. We formally derive how a simple linearshift of the potential function can be used to improve the effectiveness ofreward shaping without changing the encoded preferences in the potentialfunction, and without having to adjust the initial Q-values, which can bechallenging and undesirable in deep reinforcement learning. We show thetheoretical limitations of continuous potential functions for correctlyassigning positive and negative reward shaping values. We verify ourtheoretical findings empirically on Gridworld domains with sparse anduninformative reward functions, as well as on the Cart Pole and Mountain Carenvironments, where we demonstrate the application of our results in deepreinforcement learning.