RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning

Abstract

Large-scale chemical reaction datasets are crucial for AI research inchemistry. However, existing chemical reaction data often exist as imageswithin papers, making them not machine-readable and unusable for trainingmachine learning models. In response to this challenge, we propose theRxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP).Our framework reformulates the traditional coordinate prediction driven parsingprocess into an image captioning problem, which Large Vision-Language Models(LVLMs) handle naturally. We introduce a strategy termed "BBox and Index asVisual Prompt" (BIVP), which uses our state-of-the-art molecular detector,MolYOLO, to pre-draw molecular bounding boxes and indices directly onto theinput image. This turns the downstream parsing into a natural-languagedescription problem. Extensive experiments show that the BIVP strategysignificantly improves structural extraction quality while simplifying modeldesign. We further construct the RxnCaption-11k dataset, an order of magnitudelarger than prior real-world literature benchmarks, with a balanced test subsetacross four layout archetypes. Experiments demonstrate that RxnCaption-VLachieves state-of-the-art performance on multiple metrics. We believe ourmethod, dataset, and models will advance structured information extraction fromchemical literature and catalyze broader AI applications in chemistry. We willrelease data, models, and code on GitHub.

Quick Read (beta)

loading the full paper ...