Developing cooperative policies for multi-stage reinforcement learning tasks

Abstract

Many hierarchical reinforcement learning algorithms utilise a series ofindependent skills as a basis to solve tasks at a higher level of reasoning.These algorithms don't consider the value of using skills that are cooperativeinstead of independent. This paper proposes the Cooperative ConsecutivePolicies (CCP) method of enabling consecutive agents to cooperatively solvelong time horizon multi-stage tasks. This method is achieved by modifying thepolicy of each agent to maximise both the current and next agent's critic.Cooperatively maximising critics allows each agent to take actions that arebeneficial for its task as well as subsequent tasks. Using this method in amulti-room maze domain and a peg in hole manipulation domain, the cooperativepolicies were able to outperform a set of naive policies, a single agenttrained across the entire domain, as well as another sequential HRL algorithm.

Quick Read (beta)

loading the full paper ...