Abstract
Unsupervised reinforcement learning (URL) aims to pre-train agents byexploring diverse states or skills in reward-free environments, facilitatingefficient adaptation to downstream tasks. As the agent cannot access extrinsicrewards during unsupervised exploration, existing methods design intrinsicrewards to model the explored data and encourage further exploration. However,the explored data are always heterogeneous, posing the requirements of powerfulrepresentation abilities for both intrinsic reward models and pre-trainedpolicies. In this work, we propose the Exploratory Diffusion Model (ExDM),which leverages the strong expressive ability of diffusion models to fit theexplored data, simultaneously boosting exploration and providing an efficientinitialization for downstream tasks. Specifically, ExDM can accurately estimatethe distribution of collected data in the replay buffer with the diffusionmodel and introduces the score-based intrinsic reward, encouraging the agent toexplore less-visited states. After obtaining the pre-trained policies, ExDMenables rapid adaptation to downstream tasks. In detail, we provide theoreticalanalyses and practical algorithms for fine-tuning diffusion policies,addressing key challenges such as training instability and computationalcomplexity caused by multi-step sampling. Extensive experiments demonstratethat ExDM outperforms existing SOTA baselines in efficient unsupervisedexploration and fast fine-tuning downstream tasks, especially in structurallycomplicated environments.