Multimodal Information Bottleneck for Deep Reinforcement Learning with Multiple Sensors

Abstract

Reinforcement learning has achieved promising results on robotic controltasks but struggles to leverage information effectively from multiple sensorymodalities that differ in many characteristics. Recent works constructauxiliary losses based on reconstruction or mutual information to extract jointrepresentations from multiple sensory inputs to improve the sample efficiencyand performance of reinforcement learning algorithms. However, therepresentations learned by these methods could capture information irrelevantto learning a policy and may degrade the performance. We argue that compressinginformation in the learned joint representations about raw multimodalobservations is helpful, and propose a multimodal information bottleneck modelto learn task-relevant joint representations from egocentric images andproprioception. Our model compresses and retains the predictive information inmultimodal observations for learning a compressed joint representation, whichfuses complementary information from visual and proprioceptive feedback andmeanwhile filters out task-irrelevant information in raw multimodalobservations. We propose to minimize the upper bound of our multimodalinformation bottleneck objective for computationally tractable optimization.Experimental evaluations on several challenging locomotion tasks withegocentric images and proprioception show that our method achieves bettersample efficiency and zero-shot robustness to unseen white noise than leadingbaselines. We also empirically demonstrate that leveraging information fromegocentric images and proprioception is more helpful for learning policies onlocomotion tasks than solely using one single modality.

Quick Read (beta)

loading the full paper ...