Abstract
Safe reinforcement learning (RL) is a popular and versatile paradigm to learnreward-maximizing policies with safety guarantees. Previous works tend toexpress the safety constraints in an expectation form due to the ease ofimplementation, but this turns out to be ineffective in maintaining safetyconstraints with high probability. To this end, we move to thequantile-constrained RL that enables a higher level of safety without anyexpectation-form approximations. We directly estimate the quantile gradientsthrough sampling and provide the theoretical proofs of convergence. Then atilted update strategy for quantile gradients is implemented to compensate theasymmetric distributional density, with a direct benefit of return performance.Experiments demonstrate that the proposed model fully meets safety requirements(quantile constraints) while outperforming the state-of-the-art benchmarks withhigher return.