Abstract
Prior access to domain knowledge could significantly improve the performanceof a reinforcement learning agent. In particular, it could help agents avoidpotentially catastrophic exploratory actions, which would otherwise have to beexperienced during learning. In this work, we identify consistently undesirableactions in a set of previously learned tasks, and use pseudo-rewards associatedwith them to learn a prior policy. In addition to enabling safer exploratorybehaviors in subsequent tasks in the domain, we show that these priors aretransferable to similar environments, and can be learned off-policy and inparallel with the learning of other tasks in the domain. We compare ourapproach to established, state-of-the-art algorithms in both discrete as wellas continuous environments, and demonstrate that it exhibits a saferexploratory behavior while learning to perform arbitrary tasks in the domain.We also present a theoretical analysis to support these results, and brieflydiscuss the implications and some alternative formulations of this approach,which could also be useful in certain scenarios.