RealKIE: Five Novel Datasets for Enterprise Key Information Extraction

Abstract

We introduce RealKIE, a benchmark of five challenging datasets aimed atadvancing key information extraction methods, with an emphasis on enterpriseapplications. The datasets include a diverse range of documents including SECS1 Filings, US Non-disclosure Agreements, UK Charity Reports, FCC Invoices, andResource Contracts. Each presents unique challenges: poor text serialization,sparse annotations in long documents, and complex tabular layouts. Thesedatasets provide a realistic testing ground for key information extractiontasks like investment analysis and contract analysis. In addition to presentingthese datasets, we offer an in-depth description of the annotation process,document processing techniques, and baseline modeling approaches. Thiscontribution facilitates the development of NLP models capable of handlingpractical challenges and supports further research into information extractiontechnologies applicable to industry-specific problems. The annotated data, OCRoutputs, and code to reproduce baselines are available to download athttps://indicodatasolutions.github.io/RealKIE/.

Quick Read (beta)

loading the full paper ...