Accelerating Training in Pommerman with Imitation and Reinforcement Learning

Abstract

The Pommerman simulation was recently developed to mimic the classic Japanesegame Bomberman, and focuses on competitive gameplay in a multi-agent setting.We focus on the 2$\times$2 team version of Pommerman, developed for acompetition at NeurIPS2018\footnote{https://nips.cc/Conferences/2018/CompetitionTrack}. Ourmethodology involves training an agent initially through imitation learning ona noisy expert policy, followed by a proximal-policy optimization (PPO)reinforcement learning algorithm. The basic PPO approach is modified for stabletransition from the imitation learning phase through reward shaping, actionfilters based on heuristics, and curriculum learning. The proposed methodologyis able to beat heuristic and pure reinforcement learning baselines with acombined 100,000 training games, significantly faster than othernon-tree-search methods in literature. We present results against multipleagents provided by the developers of the simulation, including some that wehave enhanced. We include a sensitivity analysis over different parameters, andhighlight undesirable effects of some strategies that initially appearpromising. Since Pommerman is a complex multi-agent competitive environment,the strategies developed here provide insights into several real-world problemswith characteristics such as partial observability, decentralized execution(without communication), and very sparse and delayed rewards.

Quick Read (beta)

loading the full paper ...