Abstract
In this paper, we consider a transfer Reinforcement Learning (RL) problem incontinuous state and action spaces, under unobserved contextual information.For example, the context can represent the mental view of the world that anexpert agent has formed through past interactions with this world. We assumethat this context is not accessible to a learner agent who can only observe theexpert data. Then, our goal is to use the context-aware expert data to learn anoptimal context-unaware policy for the learner using only a few new datasamples. Such problems are typically solved using imitation learning thatassumes that both the expert and learner agents have access to the sameinformation. However, if the learner does not know the expert context, usingthe expert data alone will result in a biased learner policy and will requiremany new data samples to improve. To address this challenge, in this paper, weformulate the learning problem as a causal bound-constrained Multi-Armed-Bandit(MAB) problem. The arms of this MAB correspond to a set of basis policyfunctions that can be initialized in an unsupervised way using the expert dataand represent the different expert behaviors affected by the unobservedcontext. On the other hand, the MAB constraints correspond to causal bounds onthe accumulated rewards of these basis policy functions that we also computefrom the expert data. The solution to this MAB allows the learner agent toselect the best basis policy and improve it online. And the use of causalbounds reduces the exploration variance and, therefore, improves the learningrate. We provide numerical experiments on an autonomous driving example thatshow that our proposed transfer RL method improves the learner's policy fastercompared to existing imitation learning methods and enjoys much lower varianceduring training.