Logically-Correct Reinforcement Learning

Abstract

We propose a novel Reinforcement Learning (RL) algorithm to synthesizepolicies for a Markov Decision Process (MDP), such that a linear time propertyis satisfied. We convert the property into a Limit Deterministic BuchiAutomaton (LDBA), then construct a product MDP between the automaton and theoriginal MDP. A reward function is then assigned to the states of the productautomaton, according to accepting conditions of the LDBA. With this rewardfunction, RL synthesizes a policy that satisfies the property: as such, thepolicy synthesis procedure is "constrained" by the given specification.Additionally, we show that the RL procedure sets up an online value iterationmethod to calculate the maximum probability of satisfying the given property,at any given state of the MDP - a convergence proof for the procedure isprovided. Finally, the performance of the algorithm is evaluated via a set ofnumerical examples. We observe an improvement of one order of magnitude in thenumber of iterations required for the synthesis compared to existingapproaches.

Quick Read (beta)

loading the full paper ...