Reinforcement Learning of Control Policy for Linear Temporal Logic Specifications Using Limit-Deterministic Büchi Automata

Abstract

This letter proposes a novel reinforcement learning method for the synthesisof a control policy satisfying a control specification described by a lineartemporal logic formula. We assume that the controlled system is modeled by aMarkov decision process (MDP). We transform the specification to alimit-deterministic B\"uchi automaton (LDBA) with several accepting sets thataccepts all infinite sequences satisfying the formula. The LDBA is augmented sothat it explicitly records the previous visits to accepting sets. We take aproduct of the augmented LDBA and the MDP, based on which we define a rewardfunction. The agent gets rewards whenever state transitions are in an acceptingset that has not been visited for a certain number of steps. Consequently,sparsity of rewards is relaxed and optimal circulations among the acceptingsets are learned. We show that the proposed method can learn an optimal policywhen the discount factor is sufficiently close to one.

Quick Read (beta)

loading the full paper ...