Abstract
On-policy reinforcement learning (RL) algorithms are widely used for theirstrong asymptotic performance and training stability, but they struggle toscale with larger batch sizes, as additional parallel environments yieldredundant data due to limited policy-induced diversity. In contrast,Evolutionary Algorithms (EAs) scale naturally and encourage exploration viarandomized population-based search, but are often sample-inefficient. Wepropose Evolutionary Policy Optimization (EPO), a hybrid algorithm thatcombines the scalability and diversity of EAs with the performance andstability of policy gradients. EPO maintains a population of agents conditionedon latent variables, shares actor-critic network parameters for coherence andmemory efficiency, and aggregates diverse experiences into a master agent.Across tasks in dexterous manipulation, legged locomotion, and classic control,EPO outperforms state-of-the-art baselines in sample efficiency, asymptoticperformance, and scalability.