Abstract
While densely annotated image captions significantly facilitate the learningof robust vision-language alignment, methodologies for systematicallyoptimizing human annotation efforts remain underexplored. We introduceChain-of-Talkers (CoTalk), an AI-in-the-loop methodology designed to maximizethe number of annotated samples and improve their comprehensiveness under fixedbudget constraints (e.g., total human annotation time). The framework is builtupon two key insights. First, sequential annotation reduces redundant workloadcompared to conventional parallel annotation, as subsequent annotators onlyneed to annotate the ``residual'' -- the missing visual information thatprevious annotations have not covered. Second, humans process textual inputfaster by reading while outputting annotations with much higher throughput viatalking; thus a multimodal interface enables optimized efficiency. We evaluateour framework from two aspects: intrinsic evaluations that assess thecomprehensiveness of semantic units, obtained by parsing detailed captions intoobject-attribute trees and analyzing their effective connections; extrinsicevaluation measures the practical usage of the annotated captions infacilitating vision-language alignment. Experiments with eight participantsshow our Chain-of-Talkers (CoTalk) improves annotation speed (0.42 vs. 0.30units/sec) and retrieval performance (41.13\% vs. 40.52\%) over the parallelmethod.