Data-Efficient Information Extraction from Form-Like Documents

  • 2022-01-07 19:16:49
  • Beliz Gunel, Navneet Potti, Sandeep Tata, James B. Wendt, Marc Najork, Jing Xie
  • 12

Abstract

Automating information extraction from form-like documents at scale is apressing need due to its potential impact on automating business workflowsacross many industries like financial services, insurance, and healthcare. Thekey challenge is that form-like documents in these business workflows can belaid out in virtually infinitely many ways; hence, a good solution to thisproblem should generalize to documents with unseen layouts and languages. Asolution to this problem requires a holistic understanding of both the textualsegments and the visual cues within a document, which is non-trivial. While thenatural language processing and computer vision communities are starting totackle this problem, there has not been much focus on (1) data-efficiency, and(2) ability to generalize across different document types and languages. In this paper, we show that when we have only a small number of labeleddocuments for training (~50), a straightforward transfer learning approach froma considerably structurally-different larger labeled corpus yields up to a 27F1 point improvement over simply training on the small corpus in the targetdomain. We improve on this with a simple multi-domain transfer learningapproach, that is currently in production use, and show that this yields up toa further 8 F1 point improvement. We make the case that data efficiency iscritical to enable information extraction systems to scale to handle hundredsof different document-types, and learning good representations is critical toaccomplishing this.

 

Quick Read (beta)

loading the full paper ...