Abstract
We present Sequential Attend, Infer, Repeat (SQAIR), an interpretable deepgenerative model for videos of moving objects. It can reliably discover andtrack objects throughout the sequence of frames, and can also generate futureframes conditioning on the current frame, thereby simulating expected motion ofobjects. This is achieved by explicitly encoding object presence, locations andappearances in the latent variables of the model. SQAIR retains all strengthsof its predecessor, Attend, Infer, Repeat (AIR, Eslami et. al., 2016),including learning in an unsupervised manner, and addresses its shortcomings.We use a moving multi-MNIST dataset to show limitations of AIR in detectingoverlapping or partially occluded objects, and show how SQAIR overcomes them byleveraging temporal consistency of objects. Finally, we also apply SQAIR toreal-world pedestrian CCTV data, where it learns to reliably detect, track andgenerate walking pedestrians with no supervision.