Accelerating Training in Pommerman with Imitation and Reinforcement Learning

Abstract

The Pommerman simulation was recently developed to mimic the classic Japanesegame Bomberman, and focuses on competitive gameplay in a multi-agent setting.We focus on the 2$\times$2 team version of Pommerman, developed for acompetition at NeurIPS 2018. Our methodology involves training an agentinitially through imitation learning on a noisy expert policy, followed by aproximal-policy optimization (PPO) reinforcement learning algorithm. The basicPPO approach is modified for stable transition from the imitation learningphase through reward shaping, action filters based on heuristics, andcurriculum learning. The proposed methodology is able to beat heuristic andpure reinforcement learning baselines with a combined 100,000 training games,significantly faster than other non-tree-search methods in literature. Wepresent results against multiple agents provided by the developers of thesimulation, including some that we have enhanced. We include a sensitivityanalysis over different parameters, and highlight undesirable effects of somestrategies that initially appear promising. Since Pommerman is a complexmulti-agent competitive environment, the strategies developed here provideinsights into several real-world problems with characteristics such as partialobservability, decentralized execution (without communication), and very sparseand delayed rewards.

Quick Read (beta)

loading the full paper ...