Abstract
Bayesian reinforcement learning (BRL) is a method that merges principles fromBayesian statistics and reinforcement learning to make optimal decisions inuncertain environments. As a model-based RL method, it has two key components:(1) inferring the posterior distribution of the model for the data-generatingprocess (DGP) and (2) policy learning using the learned posterior. We proposeto model the dynamics of the unknown environment through deep generativemodels, assuming Markov dependence. In the absence of likelihood functions forthese models, we train them by learning a generalized predictive-sequential (orprequential) scoring rule (SR) posterior. We used sequential Monte Carlo (SMC)samplers to draw samples from this generalized Bayesian posterior distribution.In conjunction, to achieve scalability in the high-dimensional parameter spaceof the neural networks, we use the gradient-based Markov kernels within SMC. Tojustify the use of the prequential scoring rule posterior, we prove aBernstein-von Mises-type theorem. For policy learning, we propose expectedThompson sampling (ETS) to learn the optimal policy by maximising the expectedvalue function with respect to the posterior distribution. This improves upontraditional Thompson sampling (TS) and its extensions, which utilize only onesample drawn from the posterior distribution. This improvement is studied boththeoretically and using simulation studies, assuming a discrete action space.Finally, we successfully extended our setup for a challenging problem with acontinuous action space without theoretical guarantees.