What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

Abstract

While there have been significant gains in the field of automated videodescription, the generalization performance of automated description models tonovel domains remains a major barrier to using these systems in the real world.Most visual description methods are known to capture and exploit patterns inthe training data leading to evaluation metric increases, but what are thosepatterns? In this work, we examine several popular visual description datasets,and capture, analyze, and understand the dataset-specific linguistic patternsthat models exploit but do not generalize to new domains. At the token level,sample level, and dataset level, we find that caption diversity is a majordriving factor behind the generation of generic and uninformative captions. Wefurther show that state-of-the-art models even outperform held-out ground truthcaptions on modern metrics, and that this effect is an artifact of linguisticdiversity in datasets. Understanding this linguistic diversity is key tobuilding strong captioning models, we recommend several methods and approachesfor maintaining diversity in the collection of new data, and dealing with theconsequences of limited diversity when using current models and metrics.

Quick Read (beta)

loading the full paper ...