EVaDE : Event-Based Variational Thompson Sampling for Model-Based Reinforcement Learning

Abstract

Posterior Sampling for Reinforcement Learning (PSRL) is a well-knownalgorithm that augments model-based reinforcement learning (MBRL) algorithmswith Thompson sampling. PSRL maintains posterior distributions of theenvironment transition dynamics and the reward function, which are intractablefor tasks with high-dimensional state and action spaces. Recent works show thatdropout, used in conjunction with neural networks, induces variationaldistributions that can approximate these posteriors. In this paper, we proposeEvent-based Variational Distributions for Exploration (EVaDE), which arevariational distributions that are useful for MBRL, especially when theunderlying domain is object-based. We leverage the general domain knowledge ofobject-based domains to design three types of event-based convolutional layersto direct exploration. These layers rely on Gaussian dropouts and are insertedbetween the layers of the deep neural network model to help facilitatevariational Thompson sampling. We empirically show the effectiveness ofEVaDE-equipped Simulated Policy Learning (EVaDE-SimPLe) on the 100K Atari gamesuite.

Quick Read (beta)

loading the full paper ...