Abstract
Bayesian reinforcement learning (BRL) is a method that merges principles fromBayesian statistics and reinforcement learning to make optimal decisions inuncertain environments. Similar to other model-based RL approaches, it involvestwo key components: (1) Inferring the posterior distribution of the datagenerating process (DGP) modeling the true environment and (2) policy learningusing the learned posterior. We propose to model the dynamics of the unknownenvironment through deep generative models assuming Markov dependence. Inabsence of likelihood functions for these models we train them by learning ageneralized predictive-sequential (or prequential) scoring rule (SR) posterior.We use sequential Monte Carlo (SMC) samplers to draw samples from thisgeneralized Bayesian posterior distribution. In conjunction, to achievescalability in the high dimensional parameter space of the neural networks, weuse the gradient based Markov chain Monte Carlo (MCMC) kernels within SMC. Tojustify the use of the prequential scoring rule posterior we prove aBernstein-von Misses type theorem. For policy learning, we propose expectedThompson sampling (ETS) to learn the optimal policy by maximizing the expectedvalue function with respect to the posterior distribution. This improves upontraditional Thompson sampling (TS) and its extensions which utilize only onesample drawn from the posterior distribution. This improvement is studied boththeoretically and using simulation studies assuming discrete action andstate-space. Finally we successfully extend our setup for a challenging problemwith continuous action space without theoretical guarantees.