Abstract
Offline safe reinforcement learning (RL) has emerged as a promising approachfor learning safe behaviors without engaging in risky online interactions withthe environment. Most existing methods in offline safe RL rely on costconstraints at each time step (derived from global cost constraints) and thiscan result in either overly conservative policies or violation of safetyconstraints. In this paper, we propose to learn a policy that generatesdesirable trajectories and avoids undesirable trajectories. To be specific, wefirst partition the pre-collected dataset of state-action trajectories intodesirable and undesirable subsets. Intuitively, the desirable set contains highreward and safe trajectories, and undesirable set contains unsafe trajectoriesand low-reward safe trajectories. Second, we learn a policy that generatesdesirable trajectories and avoids undesirable trajectories, where(un)desirability scores are provided by a classifier learnt from the dataset ofdesirable and undesirable trajectories. This approach bypasses thecomputational complexity and stability issues of a min-max objective that isemployed in existing methods. Theoretically, we also show our approach's strongconnections to existing learning paradigms involving human feedback. Finally,we extensively evaluate our method using the DSRL benchmark for offline safeRL. Empirically, our method outperforms competitive baselines, achieving higherrewards and better constraint satisfaction across a wide variety of benchmarktasks.