FAWAC: Feasibility Informed Advantage Weighted Regression for Persistent Safety in Offline Reinforcement Learning

Abstract

Safe offline reinforcement learning aims to learn policies that maximizecumulative rewards while adhering to safety constraints, using only offlinedata for training. A key challenge is balancing safety and performance,particularly when the policy encounters out-of-distribution (OOD) states andactions, which can lead to safety violations or overly conservative behaviorduring deployment. To address these challenges, we introduce FeasibilityInformed Advantage Weighted Actor-Critic (FAWAC), a method that prioritizespersistent safety in constrained Markov decision processes (CMDPs). FAWACformulates policy optimization with feasibility conditions derived specificallyfor offline datasets, enabling safe policy updates in non-parametric policyspace, followed by projection into parametric space for constrained actortraining. By incorporating a cost-advantage term into Advantage WeightedRegression (AWR), FAWAC ensures that the safety constraints are respected whilemaximizing performance. Additionally, we propose a strategy to address a morechallenging class of problems that involves tempting datasets wheretrajectories are predominantly high-rewarded but unsafe. Empirical evaluationson standard benchmarks demonstrate that FAWAC achieves strong results,effectively balancing safety and performance in learning policies from thestatic datasets.

Quick Read (beta)

loading the full paper ...