Data-Efficient Safe Policy Improvement Using Parametric Structure

Abstract

Safe policy improvement (SPI) is an offline reinforcement learning problem inwhich a new policy that reliably outperforms the behavior policy with highconfidence needs to be computed using only a dataset and the behavior policy.Markov decision processes (MDPs) are the standard formalism for modelingenvironments in SPI. In many applications, additional information in the formof parametric dependencies between distributions in the transition dynamics isavailable. We make SPI more data-efficient by leveraging these dependenciesthrough three contributions: (1) a parametric SPI algorithm that exploits knowncorrelations between distributions to more accurately estimate the transitiondynamics using the same amount of data; (2) a preprocessing technique thatprunes redundant actions from the environment through a game-based abstraction;and (3) a more advanced preprocessing technique, based on satisfiability modulotheory (SMT) solving, that can identify more actions to prune. Empiricalresults and an ablation study show that our techniques increase the dataefficiency of SPI by multiple orders of magnitude while maintaining the samereliability guarantees.

Quick Read (beta)

loading the full paper ...