Abstract
Unsupervised reinforcement learning (RL) aims to pre-train agents byexploring states or skills in reward-free environments, facilitating theadaptation to downstream tasks. However, existing methods often overlook thefitting ability of pre-trained policies and struggle to handle theheterogeneous pre-training data, which are crucial for achieving efficientexploration and fast fine-tuning. To address this gap, we propose ExploratoryDiffusion Policy (EDP), which leverages the strong expressive ability ofdiffusion models to fit the explored data, both boosting exploration andobtaining an efficient initialization for downstream tasks. Specifically, weestimate the distribution of collected data in the replay buffer with thediffusion policy and propose a score intrinsic reward, encouraging the agent toexplore unseen states. For fine-tuning the pre-trained diffusion policy ondownstream tasks, we provide both theoretical analyses and practicalalgorithms, including an alternating method of Q function optimization anddiffusion policy distillation. Extensive experiments demonstrate theeffectiveness of EDP in efficient exploration during pre-training and fastadaptation during fine-tuning.