Batch size-invariance for policy optimization

Abstract

We say an algorithm is batch size-invariant if changes to the batch size canlargely be compensated for by changes to other hyperparameters. Stochasticgradient descent is well-known to have this property at small batch sizes, viathe learning rate. However, some policy optimization algorithms (such as PPO)do not have this property, because of how they control the size of policyupdates. In this work we show how to make these algorithms batchsize-invariant. Our key insight is to decouple the proximal policy (used forcontrolling policy updates) from the behavior policy (used for off-policycorrections). Our experiments help explain why these algorithms work, andadditionally show how they can make more efficient use of stale data.

Quick Read (beta)

loading the full paper ...