Abstract
In this work we explore the use of latent representations obtained frommultiple input sensory modalities (such as images or sounds) in allowing anagent to learn and exploit policies over different subsets of input modalities.We propose a three-stage architecture that allows a reinforcement learningagent trained over a given sensory modality, to execute its task on a differentsensory modality-for example, learning a visual policy over image inputs, andthen execute such policy when only sound inputs are available. We show that thegeneralized policies achieve better out-of-the-box performance when compared todifferent baselines. Moreover, we show this holds in different OpenAI gym andvideo game environments, even when using different multimodal generative modelsand reinforcement learning algorithms.