Abstract
Reinforcement Learning (RL) offers a fundamental framework for discoveringoptimal action strategies through interactions within unknown environments.Recent advancement have shown that the performance and applicability of RL cansignificantly be enhanced by exploiting a population of agents in various ways.Zeroth-Order Optimization (ZOO) leverages an agent population to estimate thegradient of the objective function, enabling robust policy refinement even innon-differentiable scenarios. As another application, Genetic Algorithms (GA)boosts the exploration of policy landscapes by mutational generation of policydiversity in an agent population and its refinement by selection. A naturalquestion is whether we can have the best of two worlds that the agentpopulation can have. In this work, we propose Ancestral Reinforcement Learning(ARL), which synergistically combines the robust gradient estimation of ZOOwith the exploratory power of GA. The key idea in ARL is that each agent withina population infers gradient by exploiting the history of its ancestors, i.e.,the ancestor population in the past, while maintaining the diversity ofpolicies in the current population as in GA. We also theoretically reveal thatthe populational search in ARL implicitly induces the KL-regularization of theobjective function, resulting in the enhanced exploration. Our results extendthe applicability of populational algorithms for RL.