Abstract
In recent years, by leveraging more data, computation, and diverse tasks,learned optimizers have achieved remarkable success in supervised learning,outperforming classical hand-designed optimizers. Reinforcement learning (RL)is essentially different from supervised learning, and in practice, theselearned optimizers do not work well even in simple RL tasks. We investigatethis phenomenon and identify two issues. First, the agent-gradient distributionis non-independent and identically distributed, leading to inefficientmeta-training. Moreover, due to highly stochastic agent-environmentinteractions, the agent-gradients have high bias and variance, which increasesthe difficulty of learning an optimizer for RL. We propose pipeline trainingand a novel optimizer structure with a good inductive bias to address theseissues, making it possible to learn an optimizer for reinforcement learningfrom scratch. We show that, although only trained in toy tasks, our learnedoptimizer can generalize to unseen complex tasks in Brax.