Abstract
Recurrent meta reinforcement learning (meta-RL) agents are agents that employa recurrent neural network (RNN) for the purpose of "learning a learningalgorithm". After being trained on a pre-specified task distribution, thelearned weights of the agent's RNN are said to implement an efficient learningalgorithm through their activity dynamics, which allows the agent to quicklysolve new tasks sampled from the same distribution. However, due to theblack-box nature of these agents, the way in which they work is not yet fullyunderstood. In this study, we shed light on the internal working mechanisms ofthese agents by reformulating the meta-RL problem using the PartiallyObservable Markov Decision Process (POMDP) framework. We hypothesize that thelearned activity dynamics is acting as belief states for such agents. Severalillustrative experiments suggest that this hypothesis is true, and thatrecurrent meta-RL agents can be viewed as agents that learn to act optimally inpartially observable environments consisting of multiple related tasks. Thisview helps in understanding their failure cases and some interestingmodel-based results reported in the literature.