This paper proposes Localized Narratives, a new form of multimodal imageannotations connecting vision and language. We ask annotators to describe animage with their voice while simultaneously hovering their mouse over theregion they are describing. Since the voice and the mouse pointer aresynchronized, we can localize every single word in the description. This densevisual grounding takes the form of a mouse trace segment per word and is uniqueto our data. We annotate 628k images with Localized Narratives: the whole COCOdataset and 504k images of the Open Images dataset, which we make publiclyavailable. We provide an extensive analysis of these annotations showing theyare diverse, accurate, and efficient to produce. We also demonstrate theirutility on the application of controlled image captioning.
Quick Read (beta)
1 Dataset Collection, Quality, and Statistics
1.1 Dataset collection\mypartop
Image Sources and Scale. We annotated images from \coco \cococite and \oid \oidcite. In order to facilitate future comparison to previous work, we re-annotated the full set of \numimagescoco images of \coco (train and validation). For \oid, we annotated \numimagesoid images, selected from the train split. To enable cross-modal applications, we selected images for which object segmentations [benenson19cvpr], bounding boxes or visual relationships [kuznetsova18arxiv] are already available.
Overall, we annotated \numimages images (Tab. LABEL:tab:datasets). For analysis purposes, we annotated \att\num5000 \coco images with replication 5 (5 different annotators annotated each image). Beyond this, we prioritized having a larger set covered, so the rest of images were annotated with replication 1. All analysis in the remainder of this section are done on the \coco dataset.
Annotation Cost. Annotating one image with Localized Narratives takes \annotationtime seconds on average. \newWe consider this a relatively low cost given the amount of information harvested, and it allows data collection at scale (\att\numcaptions annotations so far). Manual transcription takes up the majority of the time (\transcriptiontime sec., \transcriptiontimepercent%), while the narration step only takes \narrationtime seconds (\narrationtimepercent%). In the future, when ASR systems improve further, manual transcription could be skipped and Localized Narratives could become even faster thanks to our core idea of using speech.
To put our timings into perspective, we can roughly compare to \flickr Entities [plummer17ijcv], which is the only work we are aware of that reports annotation times. They first manually identified which words constitute entities, which took 235 seconds per image. In a second stage, annotators drew bounding boxes for these selected entities, taking 408 seconds (8.7 entities per image on average). This yields a total of 643 seconds per image, without counting the time to write the actual captions (not reported). This is slower than the total annotation cost of our method, which includes the grounding of \att\num12.1 nouns per image and the writing of the caption. The \vg [krishna17ijcv] dataset was also annotated by a complex multi-stage pipeline, also involving drawing a bounding box for each phrase describing a region in the image.
1.2 Dataset Quality\new
To ensure high quality, Localized Narratives was made by 126 professional annotators working full time on this project. Annotator managers did frequent manual inspections to keep quality consistently high. In addition, we used an automatic quality control mechanism to ensure that the spoken and written transcriptions match (Sec. LABEL:sec:generation – Automatic quality control). In practice, we placed a high quality bar, which resulted in discarding \att23.5% of all annotations (all dataset statistics reported in this paper are after this step). Below we analyse the quality of the annotations that remained after this automatic discarding step.
Semantic and Transcription Accuracy. In this section we quantify (i) how well the noun phrases and verbs in the caption correctly represent the objects in the image (Semantic accuracy) and (ii) how well the manually transcribed caption matches the voice recording (Transcription accuracy). We manually check every word in \att\num100 randomly selected Localized Narratives annotations and log each of these two types of errors. This checking was performed carefully by experts (\iethe authors of this paper), not by the annotators themselves (and hence are an independent source).
In terms of semantic accuracy, we check every noun and verb in a caption and assess whether that object or action is indeed present in the corresponding image. We allow generality up to a base class name (\eg we count either “dog” or “Chihuahua” as correct for a Chihuahua in the image) and we strictly enforce correctness (\eg we count “skating” as incorrect when the correct term is “snowboarding” or “bottle” in the case of a “jar”). Under these criteria, semantic accuracy is very high: \att\num98.0% of the nouns and verbs are accurate.
In terms of transcription accuracy, we listen to the voice recordings and compare them to the manual transcriptions. We count every instance of (i) a missing word in the transcription, (ii) an extra word in the transcription, and (iii) a word with typographical errors. We normalize the number of words with errors by the total number of words in the 100 captions. This results in \att\num3.8% for type (i), \att\num1.5% for (ii), and \att\num1.9% for (iii), showing transcription accuracy is high.
Localization Accuracy. To analyze how well the mouse traces match the location of actual objects in the image, we extract all instances of any of the 80 \coco object classes in our captions (exact string matching). We recover \att\num146723 instances. We then associate each mouse trace segment to the closest ground-truth box of its corresponding class. Figure 2 displays the 2D histogram of the positions of all trace segment points with respect to the closest box (2), normalized by box size. We observe that most of the trace points are within the correct bounding box.
We attribute the trace points that fall outside the box to two different effects. First, circling around the objects is commonly used by annotators (Fig. LABEL:fig:loc_narr_intro_example and Fig. 3). This causes the mouse traces to be close to the box, but not inside it. Second, some annotators sometimes start moving the mouse before they describe the object, or vice versa. We see both cases as a research opportunity to better understand the connection between vision and language.
1.3 Dataset Statistics\mypar
Richness. The mean length of the captions we produced is \captionlength words (Tab. LABEL:tab:datasets), substantially longer than previous captioning datasets (\eg\att\num4 longer than \newthe individual \coco captions). We also compare in terms of the average number of nouns, pronouns, adjectives, verbs, and adpositions (prepositions and postpositions, Tab. 1). We determined this using the spaCy [spacy] part-of-speech tagger. Localized Narratives has a higher occurrence \newper caption for each of these categories compared to previous datasets, which indicates that our annotations provide richer use of natural language in connection to the images they describe.
|\midrule\coco Captions [chen15arxiv]||\num10.5||\num3.6||\num0.2||\num0.8||\num1.7||\num0.9|
|Loc. Narratives (Ours)||\captionlength||\att\num12.1||\att\num3.8||\att\num2.0||\att\num5.3||\att\num4.2|
Diversity. To illustrate the diversity of our captions, we plot the distribution of the number of nouns per caption, and compare it to the distributions obtained over previous datasets (Fig. 2). We observe that the \newrange of number of nouns is significantly higher in Localized Narratives (up to \att\num45 nouns in some images). This poses an additional challenge for captioning methods: automatically adapting the length of the descriptions to each image, as a function of the richness of its content. Beyond nouns, Localized Narratives provide visual grounding for every word (verbs, prepositions, etc.). This is especially interesting for relationship words, \eg“woman holding ballon” (Fig. LABEL:fig:loc_narr_intro_example) or “with a hand under his chin” (Fig. LABEL:fig:sample_datasets(d)). This opens the door to a new venue of research: understanding how humans naturally ground visual relationships.
Diversity in Localized Narratives is present not only in the language modality, but also in the visual modality, such as the different ways to indicate the spatial location of objects in an image. In contrast to previous works, where the grounding is in the form of a bounding box, our instructions lets the annotator hover the mouse over the object in any way they feel natural. This leads to diverse styles of creating trace segments (Fig. 3): circling around an object (sometimes without even intersecting it), scribbling over it, underlining in case of text, etc. This diversity also presents another challenge: detect and adapt to different trace styles in order to make full use of them.
|Ship||Open land with some grass on it||Main stairs|