Logically-Constrained Reinforcement Learning

Abstract

This paper proposes a model-free Reinforcement Learning (RL) algorithm tosynthesise policies for an unknown Markov Decision Process (MDP), such that alinear time property is satisfied. We convert the given property into a LimitDeterministic Buchi Automaton (LDBA), then construct a synchronized MDP betweenthe automaton and the original MDP. According to the resulting LDBA, a rewardfunction is then defined over the state-action pairs of the product MDP. Withthis reward function, our algorithm synthesises a policy whose traces satisfiesthe linear time property: as such, the policy synthesis procedure is"constrained" by the given specification. Additionally, we show that the RLprocedure sets up an online value iteration method to calculate the maximumprobability of satisfying the given property, at any given state of the MDP - aconvergence proof for the procedure is provided. Finally, the performance ofthe algorithm is evaluated via a set of numerical examples. We observe animprovement of one order of magnitude in the number of iterations required forthe synthesis compared to existing approaches.

Quick Read (beta)

loading the full paper ...