Structured Evolution with Compact Architectures for Scalable Policy Optimization

Abstract

We present a new method of blackbox optimization via gradient approximationwith the use of structured random orthogonal matrices, providing more accurateestimators than baselines and with provable theoretical guarantees. We showthat this algorithm can be successfully applied to learn better quality compactpolicies than those using standard gradient estimation techniques. The compactpolicies we learn have several advantages over unstructured ones, includingfaster training algorithms and faster inference. These benefits are importantwhen the policy is deployed on real hardware with limited resources. Further,compact policies provide more scalable architectures for derivative-freeoptimization (DFO) in high-dimensional spaces. We show that most robotics tasksfrom the OpenAI Gym can be solved using neural networks with less than 300parameters, with almost linear time complexity of the inference phase, with upto 13x fewer parameters relative to the Evolution Strategies (ES) algorithmintroduced by Salimans et al. (2017). We do not need heuristics such as fitnessshaping to learn good quality policies, resulting in a simple and theoreticallymotivated training mechanism.

Quick Read (beta)

loading the full paper ...