Abstract
Reinforcement learning algorithms solve sequential decision-making problemsin probabilistic environments by optimizing for long-term reward. The desire touse reinforcement learning in safety-critical settings inspires a recent lineof work on formally constrained reinforcement learning; however, these methodsplace the implementation of the learning algorithm in their Trusted ComputingBase. The crucial correctness property of these implementations is a guaranteethat the learning algorithm converges to an optimal policy. This paper beginsthe work of closing this gap by developing a Coq formalization of two canonicalreinforcement learning algorithms: value and policy iteration for finite stateMarkov decision processes. The central results are a formalization of Bellman'soptimality principle and its proof, which uses a contraction property ofBellman optimality operator to establish that a sequence converges in theinfinite horizon limit. The CertRL development exemplifies how the Giry monadand mechanized metric coinduction streamline optimality proofs forreinforcement learning algorithms. The CertRL library provides a generalframework for proving properties about Markov decision processes andreinforcement learning algorithms, paving the way for further work onformalization of reinforcement learning algorithms.