Safer Deep RL with Shallow MCTS: A Case Study in Pommerman

Abstract

Safe reinforcement learning has many variants and it is still an openresearch problem. Here, we focus on how to use action guidance by means of anon-expert demonstrator to avoid catastrophic events in a domain with sparse,delayed, and deceptive rewards: the recently-proposed multi-agent benchmark ofPommerman. This domain is very challenging for reinforcement learning (RL) ---past work has shown that model-free RL algorithms fail to achieve significantlearning. In this paper, we shed light into the reasons behind this failure byexemplifying and analyzing the high rate of catastrophic events (i.e.,suicides) that happen under random exploration in this domain. While model-freerandom exploration is typically futile, we propose a new framework where even anon-expert simulated demonstrator, e.g., planning algorithms such as MonteCarlo tree search with small number of rollouts, can be integrated toasynchronous distributed deep reinforcement learning methods. Compared tovanilla deep RL algorithms, our proposed methods both learn faster and convergeto better policies on a two-player mini version of the Pommerman game.

Quick Read (beta)

loading the full paper ...