Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks

Abstract

In continuing tasks, average-reward reinforcement learning may be a moreappropriate problem formulation than the more common discounted rewardformulation. As usual, learning an optimal policy in this setting typicallyrequires a large amount of training experiences. Reward shaping is a commonapproach for incorporating domain knowledge into reinforcement learning inorder to speed up convergence to an optimal policy. However, to the best of ourknowledge, the theoretical properties of reward shaping have thus far only beenestablished in the discounted setting. This paper presents the first rewardshaping framework for average-reward learning and proves that, under standardassumptions, the optimal policy under the original reward function can berecovered. In order to avoid the need for manual construction of the shapingfunction, we introduce a method for utilizing domain knowledge expressed as atemporal logic formula. The formula is automatically translated to a shapingfunction that provides additional reward throughout the learning process. Weevaluate the proposed method on three continuing tasks. In all cases, shapingspeeds up the average-reward learning rate without any reduction in theperformance of the learned policy compared to relevant baselines.

Quick Read (beta)

loading the full paper ...