Characterizing Policy Divergence for Personalized Meta-Reinforcement Learning

Abstract

Despite ample motivation from costly exploration and limited trajectory data,rapidly adapting to new environments with few-shot reinforcement learning (RL)can remain a challenging task, especially with respect to personalizedsettings. Here, we consider the problem of recommending optimal policies to aset of multiple entities each with potentially different characteristics, suchthat individual entities may parameterize distinct environments with uniquetransition dynamics. Inspired by existing literature in meta-learning, weextend previous work by focusing on the notion that certain environments aremore similar to each other than others in personalized settings, and propose amodel-free meta-learning algorithm that prioritizes past experiences byrelevance during gradient-based adaptation. Our algorithm involvescharacterizing past policy divergence through methods in inverse reinforcementlearning, and we illustrate how such metrics are able to effectivelydistinguish past policy parameters by the environment they were deployed in,leading to more effective fast adaptation during test time. To studypersonalization more effectively we introduce a navigation testbed tospecifically incorporate environment diversity across training episodes, anddemonstrate that our approach outperforms meta-learning alternatives withrespect to few-shot reinforcement learning in personalized settings.

Quick Read (beta)

loading the full paper ...