Latent Exploration for Reinforcement Learning

Abstract

In Reinforcement Learning, agents learn policies by exploring and interactingwith the environment. Due to the curse of dimensionality, learning policiesthat map high-dimensional sensory input to motor output is particularlychallenging. During training, state of the art methods (SAC, PPO, etc.) explorethe environment by perturbing the actuation with independent Gaussian noise.While this unstructured exploration has proven successful in numerous tasks, itought to be suboptimal for overactuated systems. When multiple actuators, suchas motors or muscles, drive behavior, uncorrelated perturbations riskdiminishing each other's effect, or modifying the behavior in a task-irrelevantway. While solutions to introduce time correlation across action perturbationsexist, introducing correlation across actuators has been largely ignored. Here,we propose LATent TIme-Correlated Exploration (Lattice), a method to injecttemporally-correlated noise into the latent state of the policy network, whichcan be seamlessly integrated with on- and off-policy algorithms. We demonstratethat the noisy actions generated by perturbing the network's activations can bemodeled as a multivariate Gaussian distribution with a full covariance matrix.In the PyBullet locomotion tasks, Lattice-SAC achieves state of the artresults, and reaches 18% higher reward than unstructured exploration in theHumanoid environment. In the musculoskeletal control environments of MyoSuite,Lattice-PPO achieves higher reward in most reaching and object manipulationtasks, while also finding more energy-efficient policies with reductions of20-60%. Overall, we demonstrate the effectiveness of structured action noise intime and actuator space for complex motor control tasks.

Quick Read (beta)

loading the full paper ...