CertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq

  • 2020-09-23 22:28:17
  • Koundinya Vajjha, Avraham Shinnar, Vasily Pestun, Barry Trager, Nathan Fulton
  • 7

Abstract

Reinforcement learning algorithms solve sequential decision-making problemsin probabilistic environments by optimizing for long-term reward. The desire touse reinforcement learning in safety-critical settings inspires a recent lineof work on formally constrained reinforcement learning; however, these methodsplace the implementation of the learning algorithm in their Trusted ComputingBase. The crucial correctness property of these implementations is a guaranteethat the learning algorithm converges to an optimal policy. This paper beginsthe work of closing this gap by developing a Coq formalization of two canonicalreinforcement learning algorithms: value and policy iteration for finite stateMarkov decision processes. The central results are a formalization of Bellman'soptimality principle and its proof, which uses a contraction property ofBellman optimality operator to establish that a sequence converges in theinfinite horizon limit. The CertRL development exemplifies how the Giry monadand mechanized metric coinduction streamline optimality proofs forreinforcement learning algorithms. The CertRL library provides a generalframework for proving properties about Markov decision processes andreinforcement learning algorithms, paving the way for further work onformalization of reinforcement learning algorithms.

 

Quick Read (beta)

loading the full paper ...