Offline-to-Online Multi-Agent Reinforcement Learning with Offline Value Function Memory and Sequential Exploration

Abstract

Offline-to-Online Reinforcement Learning has emerged as a powerful paradigm,leveraging offline data for initialization and online fine-tuning to enhanceboth sample efficiency and performance. However, most existing research hasfocused on single-agent settings, with limited exploration of the multi-agentextension, i.e., Offline-to-Online Multi-Agent Reinforcement Learning (O2OMARL). In O2O MARL, two critical challenges become more prominent as the numberof agents increases: (i) the risk of unlearning pre-trained Q-values due todistributional shifts during the transition from offline-to-online phases, and(ii) the difficulty of efficient exploration in the large joint state-actionspace. To tackle these challenges, we propose a novel O2O MARL framework calledOffline Value Function Memory with Sequential Exploration (OVMSE). First, weintroduce the Offline Value Function Memory (OVM) mechanism to compute targetQ-values, preserving knowledge gained during offline training, ensuringsmoother transitions, and enabling efficient fine-tuning. Second, we propose adecentralized Sequential Exploration (SE) strategy tailored for O2O MARL, whicheffectively utilizes the pre-trained offline policy for exploration, therebysignificantly reducing the joint state-action space to be explored. Extensiveexperiments on the StarCraft Multi-Agent Challenge (SMAC) demonstrate thatOVMSE significantly outperforms existing baselines, achieving superior sampleefficiency and overall performance.

Quick Read (beta)

loading the full paper ...