Abstract
Reinforcement learning (RL) in the real world necessitates the development ofprocedures that enable agents to explore without causing harm to themselves orothers. The most successful solutions to the problem of safe RL leverageoffline data to learn a safe-set, enabling safe online exploration. However,this approach to safe-learning is often constrained by the demonstrations thatare available for learning. In this paper we investigate the influence of the quantity and quality ofdata used to train the initial safe learning problem offline on the ability tolearn safe-RL policies online. Specifically, we focus on tasks with spatiallyextended goal states where we have few or no demonstrations available.Classically this problem is addressed either by using hand-designed controllersto generate data or by collecting user-generated demonstrations. However, thesemethods are often expensive and do not scale to more complex tasks andenvironments. To address this limitation we propose an unsupervised RL-basedoffline data collection procedure, to learn complex and scalable policieswithout the need for hand-designed controllers or user demonstrations. Ourresearch demonstrates the significance of providing sufficient demonstrationsfor agents to learn optimal safe-RL policies online, and as a result, wepropose optimistic forgetting, a novel online safe-RL approach that ispractical for scenarios with limited data. Further, our unsupervised datacollection approach highlights the need to balance diversity and optimality forsafe online exploration.