Abstract
One of the fundamental challenges associated with reinforcement learning (RL)is that collecting sufficient data can be both time-consuming and expensive. Inthis paper, we formalize a concept of time reversal symmetry in a Markovdecision process (MDP), which builds upon the established structure ofdynamically reversible Markov chains (DRMCs) and time-reversibility inclassical physics. Specifically, we investigate the utility of this concept inreducing the sample complexity of reinforcement learning. We observe thatutilizing the structure of time reversal in an MDP allows every environmenttransition experienced by an agent to be transformed into a feasiblereverse-time transition, effectively doubling the number of experiences in theenvironment. To test the usefulness of this newly synthesized data, we developa novel approach called time symmetric data augmentation (TSDA) and investigateits application in both proprioceptive and pixel-based state within the realmof off-policy, model-free RL. Empirical evaluations showcase how thesesynthetic transitions can enhance the sample efficiency of RL agents in timereversible scenarios without friction or contact. We also test this method inmore realistic environments where these assumptions are not globally satisfied.We find that TSDA can significantly degrade sample efficiency and policyperformance, but can also improve sample efficiency under the right conditions.Ultimately we conclude that time symmetry shows promise in enhancing the sampleefficiency of reinforcement learning and provide guidance when the environmentand reward structures are of an appropriate form for TSDA to be employedeffectively.