Abstract
The fundamental assumption of reinforcement learning in Markov decisionprocesses (MDPs) is that the relevant decision process is, in fact, Markov.However, when MDPs have rich observations, agents typically learn by way of anabstract state representation, and such representations are not guaranteed topreserve the Markov property. We introduce a novel set of conditions and provethat they are sufficient for learning a Markov abstract state representation.We then describe a practical training procedure that combines inverse modelestimation and temporal contrastive learning to learn an abstraction thatapproximately satisfies these conditions. Our novel training objective iscompatible with both online and offline training: it does not require a rewardsignal, but agents can capitalize on reward information when available. Weempirically evaluate our approach on a visual gridworld domain and a set ofcontinuous control benchmarks. Our approach learns representations that capturethe underlying structure of the domain and lead to improved sample efficiencyover state-of-the-art deep reinforcement learning with visual features -- oftenmatching or exceeding the performance achieved with hand-designed compact stateinformation.