Abstract
We study offline meta-reinforcement learning, a practical reinforcementlearning paradigm that learns from offline data to adapt to new tasks. Thedistribution of offline data is determined jointly by the behavior policy andthe task. Existing offline meta-reinforcement learning algorithms cannotdistinguish these factors, making task representations unstable to the changeof behavior policies. To address this problem, we propose a contrastivelearning framework for task representations that are robust to the distributionmismatch of behavior policies in training and test. We design a bi-levelencoder structure, use mutual information maximization to formalize taskrepresentation learning, derive a contrastive learning objective, and introduceseveral approaches to approximate the true distribution of negative pairs.Experiments on a variety of offline meta-reinforcement learning benchmarksdemonstrate the advantages of our method over prior methods, especially on thegeneralization to out-of-distribution behavior policies. The code is availableat https://github.com/PKU-AI-Edge/CORRO.