Fast Value Tracking for Deep Reinforcement Learning

Abstract

Reinforcement learning (RL) tackles sequential decision-making problems bycreating agents that interacts with their environment. However, existingalgorithms often view these problem as static, focusing on point estimates formodel parameters to maximize expected rewards, neglecting the stochasticdynamics of agent-environment interactions and the critical role of uncertaintyquantification. Our research leverages the Kalman filtering paradigm tointroduce a novel and scalable sampling algorithm called Langevinized KalmanTemporal-Difference (LKTD) for deep reinforcement learning. This algorithm,grounded in Stochastic Gradient Markov Chain Monte Carlo (SGMCMC), efficientlydraws samples from the posterior distribution of deep neural networkparameters. Under mild conditions, we prove that the posterior samplesgenerated by the LKTD algorithm converge to a stationary distribution. Thisconvergence not only enables us to quantify uncertainties associated with thevalue function and model parameters but also allows us to monitor theseuncertainties during policy updates throughout the training phase. The LKTDalgorithm paves the way for more robust and adaptable reinforcement learningapproaches.

Quick Read (beta)

loading the full paper ...