SAMG: State-Action-Aware Offline-to-Online Reinforcement Learning with Offline Model Guidance

Abstract

The offline-to-online (O2O) paradigm in reinforcement learning (RL) utilizespre-trained models on offline datasets for subsequent online fine-tuning.However, conventional O2O RL algorithms typically require maintaining andretraining the large offline datasets to mitigate the effects ofout-of-distribution (OOD) data, which limits their efficiency in exploitingonline samples. To address this challenge, we introduce a new paradigm calledSAMG: State-Action-Conditional Offline-to-Online Reinforcement Learning withOffline Model Guidance. In particular, rather than directly training on offlinedata, SAMG freezes the pre-trained offline critic to provide offline values foreach state-action pair to deliver compact offline information. This frameworkeliminates the need for retraining with offline data by freezing and leveragingthese values of the offline model. These are then incorporated with the onlinetarget critic using a Bellman equation weighted by a policy state-action-awarecoefficient. This coefficient, derived from a conditional variationalauto-encoder (C-VAE), aims to capture the reliability of the offline data on astate-action level. SAMG could be easily integrated with existing Q-functionbased O2O RL algorithms. Theoretical analysis shows good optimality and lowerestimation error of SAMG. Empirical evaluations demonstrate that SAMGoutperforms four state-of-the-art O2O RL algorithms in the D4RL benchmark.

Quick Read (beta)

loading the full paper ...