Abstract
Existing actor-critic algorithms, which are popular for continuous controlreinforcement learning (RL) tasks, suffer from poor sample efficiency due tolack of principled exploration mechanism within them. Motivated by the successof Thompson sampling for efficient exploration in RL, we propose a novelmodel-free RL algorithm, Langevin Soft Actor Critic (LSAC), which prioritizesenhancing critic learning through uncertainty estimation over policyoptimization. LSAC employs three key innovations: approximate Thompson samplingthrough distributional Langevin Monte Carlo (LMC) based $Q$ updates, paralleltempering for exploring multiple modes of the posterior of the $Q$ function,and diffusion synthesized state-action samples regularized with $Q$ actiongradients. Our extensive experiments demonstrate that LSAC outperforms ormatches the performance of mainstream model-free RL algorithms for continuouscontrol tasks. Notably, LSAC marks the first successful application of an LMCbased Thompson sampling in continuous control tasks with continuous actionspaces.