What Makes for Good Image Captions?

  • 2025-08-20 17:41:38
  • Delong Chen, Samuel Cahyawijaya, Etsuko Ishii, Ho Shu Chan, Yejin Bang, Pascale Fung
  • 0

Abstract

This paper establishes a formal information-theoretic framework for imagecaptioning, conceptualizing captions as compressed linguistic representationsthat selectively encode semantic units in images. Our framework posits thatgood image captions should balance three key aspects: informationallysufficient, minimally redundant, and readily comprehensible by humans. Byformulating these aspects as quantitative measures with adjustable weights, ourframework provides a flexible foundation for analyzing and optimizing imagecaptioning systems across diverse task requirements. To demonstrate itsapplicability, we introduce the Pyramid of Captions (PoCa) method, whichgenerates enriched captions by integrating local and global visual information.We present both theoretical proof that PoCa improves caption quality undercertain assumptions, and empirical validation of its effectiveness acrossvarious image captioning models and datasets.

 

Quick Read (beta)

loading the full paper ...