Abstract
Recent technological progress in the development of Unmanned Aerial Vehicles(UAVs) together with decreasing acquisition costs make the application of dronefleets attractive for a wide variety of tasks. In agriculture, disastermanagement, search and rescue operations, commercial and military applications,the advantage of applying a fleet of drones originates from their ability tocooperate autonomously. Multi-Agent Reinforcement Learning approaches that aimto optimize a neural network based control policy, such as the best performingactor-critic policy gradient algorithms, struggle to effectively back-propagateerrors of distinct rewards signal sources and tend to favor lucrative signalswhile neglecting coordination and exploitation of previously learnedsimilarities. We propose a Multi-Critic Policy Optimization architecture withmultiple value estimating networks and a novel advantage function thatoptimizes a stochastic actor policy network to achieve optimal coordination ofagents. Consequently, we apply the algorithm to several tasks that require thecollaboration of multiple drones in a physics-based reinforcement learningenvironment. Our approach achieves a stable policy network update andsimilarity in reward signal development for an increasing number of agents. Theresulting policy achieves optimal coordination and compliance with constraintssuch as collision avoidance.