Abstract
Popular reinforcement learning (RL) algorithms tend to produce a unimodalpolicy distribution, which weakens the expressiveness of complicated policy anddecays the ability of exploration. The diffusion probability model is powerfulto learn complicated multimodal distributions, which has shown promising andpotential applications to RL. In this paper, we formally build a theoreticalfoundation of policy representation via the diffusion probability model andprovide practical implementations of diffusion policy for online model-free RL.Concretely, we character diffusion policy as a stochastic process, which is anew approach to representing a policy. Then we present a convergence guaranteefor diffusion policy, which provides a theory to understand the multimodalityof diffusion policy. Furthermore, we propose the DIPO which is animplementation for model-free online RL with DIffusion POlicy. To the best ofour knowledge, DIPO is the first algorithm to solve model-free online RLproblems with the diffusion model. Finally, extensive empirical results showthe effectiveness and superiority of DIPO on the standard continuous controlMujoco benchmark.