Abstract
We address the challenge of offline reinforcement learning using realisticdata, specifically non-expert data collected through sub-optimal behaviorpolicies. Under such circumstance, the learned policy must be safe enough tomanage distribution shift while maintaining sufficient flexibility to deal withnon-expert (bad) demonstrations from offline data.To tackle this issue, weintroduce a novel method called Outcome-Driven Action Flexibility (ODAF), whichseeks to reduce reliance on the empirical action distribution of the behaviorpolicy, hence reducing the negative impact of those bad demonstrations.To bespecific, a new conservative reward mechanism is developed to deal withdistribution shift by evaluating actions according to whether their outcomesmeet safety requirements - remaining within the state support area, rather thansolely depending on the actions' likelihood based on offline data.Besidestheoretical justification, we provide empirical evidence on widely used MuJoCoand various maze benchmarks, demonstrating that our ODAF method, implementedusing uncertainty quantification techniques, effectively tolerates unseentransitions for improved "trajectory stitching," while enhancing the agent'sability to learn from realistic non-expert data.