Abstract
One of the key behavioral characteristics used in neuroscience to determinewhether the subject of study -- be it a rodent or a human -- exhibitsmodel-based learning is effective adaptation to local changes in theenvironment. In reinforcement learning, however, recent work has shown thatmodern deep model-based reinforcement-learning (MBRL) methods adapt poorly tosuch changes. An explanation for this mismatch is that MBRL methods aretypically designed with sample-efficiency on a single task in mind and therequirements for effective adaptation are substantially higher, both in termsof the learned world model and the planning routine. One particularlychallenging requirement is that the learned world model has to be sufficientlyaccurate throughout relevant parts of the state-space. This is challenging fordeep-learning-based world models due to catastrophic forgetting. And while areplay buffer can mitigate the effects of catastrophic forgetting, thetraditional first-in-first-out replay buffer precludes effective adaptation dueto maintaining stale data. In this work, we show that a conceptually simplevariation of this traditional replay buffer is able to overcome thislimitation. By removing only samples from the buffer from the localneighbourhood of the newly observed samples, deep world models can be builtthat maintain their accuracy across the state-space, while also being able toeffectively adapt to changes in the reward function. We demonstrate this byapplying our replay-buffer variation to a deep version of the classical Dynamethod, as well as to recent methods such as PlaNet and DreamerV2,demonstrating that deep model-based methods can adapt effectively as well tolocal changes in the environment.