Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning

Abstract

In reinforcement learning, classic on-policy evaluation methods often sufferfrom high variance and require massive online data to attain the desiredaccuracy. Previous studies attempt to reduce evaluation variance by searchingfor or designing proper behavior policies to collect data. However, theseapproaches ignore the safety of such behavior policies -- the designed behaviorpolicies have no safety guarantee and may lead to severe damage during onlineexecutions. In this paper, to address the challenge of reducing variance whileensuring safety simultaneously, we propose an optimal variance-minimizingbehavior policy under safety constraints. Theoretically, while ensuring safetyconstraints, our evaluation method is unbiased and has lower variance thanon-policy evaluation. Empirically, our method is the only existing method toachieve both substantial variance reduction and safety constraint satisfaction.Furthermore, we show our method is even superior to previous methods in bothvariance reduction and execution safety.

Quick Read (beta)

loading the full paper ...