Embodied vision for learning object representations

Abstract

Recent time-contrastive learning approaches manage to learn invariant objectrepresentations without supervision. This is achieved by mapping successiveviews of an object onto close-by internal representations. When consideringthis learning approach as a model of the development of human objectrecognition, it is important to consider what visual input a toddler wouldtypically observe while interacting with objects. First, human vision is highlyfoveated, with high resolution only available in the central region of thefield of view. Second, objects may be seen against a blurry background due toinfants' limited depth of field. Third, during object manipulation a toddlermostly observes close objects filling a large part of the field of view due totheir rather short arms. Here, we study how these effects impact the quality ofvisual representations learnt through time-contrastive learning. To this end,we let a visually embodied agent "play" with objects in different locations ofa near photo-realistic flat. During each play session the agent views an objectin multiple orientations before turning its body to view another object. Theresulting sequence of views feeds a time-contrastive learning algorithm. Ourresults show that visual statistics mimicking those of a toddler improve objectrecognition accuracy in both familiar and novel environments. We argue thatthis effect is caused by the reduction of features extracted in the background,a neural network bias for large features in the image and a greater similaritybetween novel and familiar background regions. We conclude that the embodiednature of visual learning may be crucial for understanding the development ofhuman object perception.

Quick Read (beta)

loading the full paper ...