Self Punishment and Reward Backfill for Deep Q-Learning

Abstract

Reinforcement learning agents learn by encouraging behaviours which maximizetheir total reward, usually provided by the environment. In many environments,however, the reward is provided after a series of actions rather than eachsingle action, leading the agent to experience ambiguity in terms of whetherthose actions are effective, an issue known as the credit assignment problem.In this paper, we propose two strategies inspired by behavioural psychology toenable the agent to intrinsically estimate more informative reward values foractions with no reward. The first strategy, called self-punishment (SP),discourages the agent from making mistakes that lead to undesirable terminalstates. The second strategy, called the rewards backfill (RB), backpropagatesthe rewards between two rewarded actions. We prove that, under certainassumptions and regardless of the reinforcement learning algorithm used, thesetwo strategies maintain the order of policies in the space of all possiblepolicies in terms of their total reward, and, by extension, maintain theoptimal policy. Hence, our proposed strategies integrate with any reinforcementlearning algorithm that learns a value or action-value function throughexperience. We incorporated these two strategies into three popular deepreinforcement learning approaches and evaluated the results on thirty Atarigames. After parameter tuning, our results indicate that the proposedstrategies improve the tested methods in over 65 percent of tested games by upto over 25 times performance improvement.

Quick Read (beta)

loading the full paper ...