Sparsity-based Safety Conservatism for Constrained Offline Reinforcement Learning

Abstract

Reinforcement Learning (RL) has made notable success in decision-makingfields like autonomous driving and robotic manipulation. Yet, its reliance onreal-time feedback poses challenges in costly or hazardous settings.Furthermore, RL's training approach, centered on "on-policy" sampling, doesn'tfully capitalize on data. Hence, Offline RL has emerged as a compellingalternative, particularly in conducting additional experiments is impractical,and abundant datasets are available. However, the challenge of distributionalshift (extrapolation), indicating the disparity between data distributions andlearning policies, also poses a risk in offline RL, potentially leading tosignificant safety breaches due to estimation errors (interpolation). Thisconcern is particularly pronounced in safety-critical domains, where real-worldproblems are prevalent. To address both extrapolation and interpolation errors,numerous studies have introduced additional constraints to confine policybehavior, steering it towards more cautious decision-making. While many studieshave addressed extrapolation errors, fewer have focused on providing effectivesolutions for tackling interpolation errors. For example, some works tacklethis issue by incorporating potential cost-maximizing optimization byperturbing the original dataset. However, this, involving a bi-leveloptimization structure, may introduce significant instability or complicateproblem-solving in high-dimensional tasks. This motivates us to pinpoint areaswhere hazards may be more prevalent than initially estimated based on thesparsity of available data by providing significant insight into constrainedoffline RL. In this paper, we present conservative metrics based on datasparsity that demonstrate the high generalizability to any methods and efficacycompared to using bi-level cost-ub-maximization.

Quick Read (beta)

loading the full paper ...