Offline Reinforcement Learning with Pseudometric Learning

Abstract

Offline Reinforcement Learning methods seek to learn a policy from loggedtransitions of an environment, without any interaction. In the presence offunction approximation, and under the assumption of limited coverage of thestate-action space of the environment, it is necessary to enforce the policy tovisit state-action pairs close to the support of logged transitions. In thiswork, we propose an iterative procedure to learn a pseudometric (closelyrelated to bisimulation metrics) from logged transitions, and use it to definethis notion of closeness. We show its convergence and extend it to the functionapproximation setting. We then use this pseudometric to define a new lookupbased bonus in an actor-critic algorithm: PLOFF. This bonus encourages theactor to stay close, in terms of the defined pseudometric, to the support oflogged transitions. Finally, we evaluate the method on hand manipulation andlocomotion tasks.

Quick Read (beta)

loading the full paper ...