Reinforcement Learning for Omega-Regular Specifications on Continuous-Time MDP

Abstract

Continuous-time Markov decision processes (CTMDPs) are canonical models toexpress sequential decision-making under dense-time and stochasticenvironments. When the stochastic evolution of the environment is onlyavailable via sampling, model-free reinforcement learning (RL) is thealgorithm-of-choice to compute optimal decision sequence. RL, on the otherhand, requires the learning objective to be encoded as scalar reward signals.Since doing such translations manually is both tedious and error-prone, anumber of techniques have been proposed to translate high-level objectives(expressed in logic or automata formalism) to scalar rewards for discrete-timeMarkov decision processes (MDPs). Unfortunately, no automatic translationexists for CTMDPs. We consider CTMDP environments against the learning objectives expressed asomega-regular languages. Omega-regular languages generalize regular languagesto infinite-horizon specifications and can express properties given in popularlinear-time logic LTL. To accommodate the dense-time nature of CTMDPs, weconsider two different semantics of omega-regular objectives: 1) satisfactionsemantics where the goal of the learner is to maximize the probability ofspending positive time in the good states, and 2) expectation semantics wherethe goal of the learner is to optimize the long-run expected average time spentin the ``good states" of the automaton. We present an approach enabling correcttranslation to scalar reward signals that can be readily used by off-the-shelfRL algorithms for CTMDPs. We demonstrate the effectiveness of the proposedalgorithms by evaluating it on some popular CTMDP benchmarks with omega-regularobjectives.

Quick Read (beta)

loading the full paper ...