M2CURL: Sample-Efficient Multimodal Reinforcement Learning via Self-Supervised Representation Learning for Robotic Manipulation

Abstract

One of the most critical aspects of multimodal Reinforcement Learning (RL) isthe effective integration of different observation modalities. Having robustand accurate representations derived from these modalities is key to enhancingthe robustness and sample efficiency of RL algorithms. However, learningrepresentations in RL settings for visuotactile data poses significantchallenges, particularly due to the high dimensionality of the data and thecomplexity involved in correlating visual and tactile inputs with the dynamicenvironment and task objectives. To address these challenges, we proposeMultimodal Contrastive Unsupervised Reinforcement Learning (M2CURL). Ourapproach employs a novel multimodal self-supervised learning technique thatlearns efficient representations and contributes to faster convergence of RLalgorithms. Our method is agnostic to the RL algorithm, thus enabling itsintegration with any available RL algorithm. We evaluate M2CURL on the TactileGym 2 simulator and we show that it significantly enhances the learningefficiency in different manipulation tasks. This is evidenced by fasterconvergence rates and higher cumulative rewards per episode, compared tostandard RL algorithms without our representation learning approach.

Quick Read (beta)

loading the full paper ...