Abstract
Inverse reinforcement learning is the problem of inferring a reward functionfrom an optimal policy or demonstrations by an expert. In this work, it isassumed that the reward is expressed as a reward machine whose transitionsdepend on atomic propositions associated with the state of a Markov DecisionProcess (MDP). Our goal is to identify the true reward machine using finiteinformation. To this end, we first introduce the notion of a prefix tree policywhich associates a distribution of actions to each state of the MDP and eachattainable finite sequence of atomic propositions. Then, we characterize anequivalence class of reward machines that can be identified given the prefixtree policy. Finally, we propose a SAT-based algorithm that uses informationextracted from the prefix tree policy to solve for a reward machine. It isproved that if the prefix tree policy is known up to a sufficient (but finite)depth, our algorithm recovers the exact reward machine up to the equivalenceclass. This sufficient depth is derived as a function of the number of MDPstates and (an upper bound on) the number of states of the reward machine.These results are further extended to the case where we only have access todemonstrations from an optimal policy. Several examples, including discretegrid and block worlds, a continuous state-space robotic arm, and real data fromexperiments with mice, are used to demonstrate the effectiveness and generalityof the approach.