Abstract
This paper proposes the first model-free Reinforcement Learning (RL)framework to synthesise policies for an unknown, and possibly continuous-state,Markov Decision Process (MDP), such that a given linear temporal property issatisfied. We convert the given property into a Limit Deterministic BuchiAutomaton (LDBA), namely a finite-state machine expressing the property.Exploiting the structure of the LDBA, we shape an adaptive reward functionon-the-fly, so that an RL algorithm can synthesise a policy resulting in tracesthat probabilistically satisfy the linear temporal property. This probability(certificate) is also calculated in parallel with learning, i.e. the RLalgorithm produces a policy that is certifiably safe with respect to theproperty. Under the assumption that the MDP has a finite number of states,theoretical guarantees are provided on the convergence of the RL algorithm. Wealso show that our method produces "best available" control policies when thelogical property cannot be satisfied. Whenever the MDP has a continuous statespace, we empirically show that our framework finds satisfying policies, ifthere exist such policies. Additionally, the proposed algorithm can handletime-varying periodic environments. The performance of the proposedarchitecture is evaluated via a set of numerical examples and benchmarks, wherewe observe an improvement of one order of magnitude in the number of iterationsrequired for the policy synthesis, compared to existing approaches wheneveravailable.