Unsupervised-to-Online Reinforcement Learning

Abstract

Offline-to-online reinforcement learning (RL), a framework that trains apolicy with offline RL and then further fine-tunes it with online RL, has beenconsidered a promising recipe for data-driven decision-making. While sensible,this framework has drawbacks: it requires domain-specific offline RLpre-training for each task, and is often brittle in practice. In this work, wepropose unsupervised-to-online RL (U2O RL), which replaces domain-specificsupervised offline RL with unsupervised offline RL, as a better alternative tooffline-to-online RL. U2O RL not only enables reusing a single pre-trainedmodel for multiple downstream tasks, but also learns better representations,which often result in even better performance and stability than supervisedoffline-to-online RL. To instantiate U2O RL in practice, we propose a generalrecipe for U2O RL to bridge task-agnostic unsupervised offline skill-basedpolicy pre-training and supervised online fine-tuning. Throughout ourexperiments in nine state-based and pixel-based environments, we empiricallydemonstrate that U2O RL achieves strong performance that matches or evenoutperforms previous offline-to-online RL approaches, while being able to reusea single pre-trained model for a number of different downstream tasks.

Quick Read (beta)

loading the full paper ...