Abstract
Safe reinforcement learning aims to learn a control policy while ensuringthat neither the system nor the environment gets damaged during the learningprocess. For implementing safe reinforcement learning on highly nonlinear andhigh-dimensional dynamical systems, one possible approach is to find alow-dimensional safe region via data-driven feature extraction methods, whichprovides safety estimates to the learning algorithm. As the reliability of thelearned safety estimates is data-dependent, we investigate in this work howdifferent training data will affect the safe reinforcement learning approach.By balancing between the learning performance and the risk of being unsafe, adata generation method that combines two sampling methods is proposed togenerate representative training data. The performance of the method isdemonstrated with a three-link inverted pendulum example.