Learning a subspace of policies for online adaptation in Reinforcement Learning

Abstract

Deep Reinforcement Learning (RL) is mainly studied in a setting where thetraining and the testing environments are similar. But in many practicalapplications, these environments may differ. For instance, in control systems,the robot(s) on which a policy is learned might differ from the robot(s) onwhich a policy will run. It can be caused by different internal factors (e.g.,calibration issues, system attrition, defective modules) or also by externalchanges (e.g., weather conditions). There is a need to develop RL methods thatgeneralize well to variations of the training conditions. In this article, weconsider the simplest yet hard to tackle generalization setting where the testenvironment is unknown at train time, forcing the agent to adapt to thesystem's new dynamics. This online adaptation process can be computationallyexpensive (e.g., fine-tuning) and cannot rely on meta-RL techniques since thereis just a single train environment. To do so, we propose an approach where welearn a subspace of policies within the parameter space. This subspace containsan infinite number of policies that are trained to solve the trainingenvironment while having different parameter values. As a consequence, twopolicies in that subspace process information differently and exhibit differentbehaviors when facing variations of the train environment. Our experimentscarried out over a large variety of benchmarks compare our approach withbaselines, including diversity-based methods. In comparison, our approach issimple to tune, does not need any extra component (e.g., discriminator) andlearns policies able to gather a high reward on unseen environments.

Quick Read (beta)

loading the full paper ...