Abstract
Continuously estimating an agent's state space and a representation of itssurroundings has proven vital towards full autonomy. A shared common groundamong systems which successfully achieve this feat is the integration ofpreviously encountered observations into the current state being estimated.This necessitates the use of a memory module for incorporating previouslyvisited states whilst simultaneously offering an internal representation of theobserved environment. In this work we develop a memory module which containsrigidly aligned point-embeddings that represent a coherent scene structureacquired from an RGB-D sequence of observations. The point-embeddings areextracted using modern convolutional neural network architectures, andalignment is performed by computing a dense correspondence matrix between a newobservation and the current embeddings residing in the memory module. The wholeframework is end-to-end trainable, resulting in a recurrent joint optimisationof the point-embeddings contained in the memory. This process amplifies theshared information across states, providing increased robustness and accuracy.We show significant improvement of our method across a set of experimentsperformed on the synthetic VIZDoom environment and a real world Active VisionDataset.