MOOSS: Mask-Enhanced Temporal Contrastive Learning for Smooth State Evolution in Visual Reinforcement Learning

  • 2024-09-02 19:57:53
  • Jiarui Sun, M. Ugur Akcal, Wei Zhang, Girish Chowdhary
In visual Reinforcement Learning (RL), learning from pixel-based observationsposes significant challenges on sample efficiency, primarily due to thecomplexity of extracting informative state representations fromhigh-dimensional data. Previous methods such as contrastive-based approacheshave made strides in improving sample efficiency but fall short in modeling thenuanced evolution of states. To address this, we introduce MOOSS, a novelframework that leverages a temporal contrastive objective with the help ofgraph-based spatial-temporal masking to explicitly model state evolution invisual RL. Specifically, we propose a self-supervised dual-component strategythat integrates (1) a graph construction of pixel-based observations forspatial-temporal masking, coupled with (2) a multi-level contrastive learningmechanism that enriches state representations by emphasizing temporalcontinuity and change of states. MOOSS advances the understanding of statedynamics by disrupting and learning from spatial-temporal correlations, whichfacilitates policy learning. Our comprehensive evaluation on multiplecontinuous and discrete control benchmarks shows that MOOSS outperformsprevious state-of-the-art visual RL methods in terms of sample efficiency,demonstrating the effectiveness of our method. Our code is released at


