Abstract
Safe offline reinforcement learning aims to learn policies that maximizecumulative rewards while adhering to safety constraints, using only offlinedata for training. A key challenge is balancing safety and performance,particularly when the policy encounters out-of-distribution (OOD) states andactions, which can lead to safety violations or overly conservative behaviorduring deployment. To address these challenges, we introduce FeasibilityInformed Advantage Weighted Actor-Critic (FAWAC), a method that prioritizespersistent safety in constrained Markov decision processes (CMDPs). FAWACformulates policy optimization with feasibility conditions derived specificallyfor offline datasets, enabling safe policy updates in non-parametric policyspace, followed by projection into parametric space for constrained actortraining. By incorporating a cost-advantage term into Advantage WeightedRegression (AWR), FAWAC ensures that the safety constraints are respected whilemaximizing performance. Additionally, we propose a strategy to address a morechallenging class of problems that involves tempting datasets wheretrajectories are predominantly high-rewarded but unsafe. Empirical evaluationson standard benchmarks demonstrate that FAWAC achieves strong results,effectively balancing safety and performance in learning policies from thestatic datasets.