Abstract
This paper presents Memory Augmented Policy Optimization (MAPO): a novelpolicy optimization formulation that incorporates a memory buffer of promisingtrajectories to reduce the variance of policy gradient estimates fordeterministic environments with discrete actions. The formulation expresses theexpected return objective as a weighted sum of two terms: an expectation over amemory of trajectories with high rewards, and a separate expectation over thetrajectories outside the memory. We propose 3 techniques to make an efficienttraining algorithm for MAPO: (1) distributed sampling from inside and outsidememory with an actor-learner architecture; (2) a marginal likelihood constraintover the memory to accelerate training; (3) systematic exploration to discoverhigh reward trajectories. MAPO improves the sample efficiency and robustness ofpolicy gradient, especially on tasks with a sparse reward. We evaluate MAPO onweakly supervised program synthesis from natural language with an emphasis ongeneralization. On the WikiTableQuestions benchmark we improve thestate-of-the-art by 2.5%, achieving an accuracy of 46.2%, and on the WikiSQLbenchmark, MAPO achieves an accuracy of 74.9% with only weak supervision,outperforming several strong baselines with full supervision. Our code is opensourced at https://github.com/crazydonkey200/neural-symbolic-machines.