In this paper, we explore how we can build upon the data and models ofInternet images and use them to adapt to robot vision without requiring anyextra labels. We present a framework called Self-supervised Embodied ActiveLearning (SEAL). It utilizes perception models trained on internet images tolearn an active exploration policy. The observations gathered by thisexploration policy are labelled using 3D consistency and used to improve theperception model. We build and utilize 3D semantic maps to learn both actionand perception in a completely self-supervised manner. The semantic map is usedto compute an intrinsic motivation reward for training the exploration policyand for labelling the agent observations using spatio-temporal 3D consistencyand label propagation. We demonstrate that the SEAL framework can be used toclose the action-perception loop: it improves object detection and instancesegmentation performance of a pretrained perception model by just moving aroundin training environments and the improved perception model can be used toimprove Object Goal Navigation.