Abstract
In this paper, we present a framework for reading analog clocks in naturalimages or videos. Specifically, we make the following contributions: First, wecreate a scalable pipeline for generating synthetic clocks, significantlyreducing the requirements for the labour-intensive annotations; Second, weintroduce a clock recognition architecture based on spatial transformernetworks (STN), which is trained end-to-end for clock alignment andrecognition. We show that the model trained on the proposed synthetic datasetgeneralises towards real clocks with good accuracy, advocating a Sim2Realtraining regime; Third, to further reduce the gap between simulation and realdata, we leverage the special property of time, i.e. uniformity, to generatereliable pseudo-labels on real unlabelled clock videos, and show that trainingon these videos offers further improvements while still requiring zero manualannotations. Lastly, we introduce three benchmark datasets based on COCO, OpenImages, and The Clock movie, totalling 4,472 images with clocks, with fullannotations for time, accurate to the minute.