PubTables-1M: Towards a universal dataset and metrics for training and evaluating table extraction models

Abstract

Recently, interest has grown in applying machine learning to the problem oftable structure inference and extraction from unstructured documents. However,progress in this area has been challenging both to make and to measure, due toseveral issues that arise in training and evaluating models from labeled data.This includes challenges as fundamental as the lack of a single definitiveground truth output for each input sample and the lack of an ideal metric formeasuring partial correctness for this task. To address these issues we proposea new dataset, PubMed Tables One Million (PubTables-1M), and a new class ofmetric, grid table similarity (GriTS). PubTables-1M is nearly twice as large asthe previous largest comparable dataset, contains highly-detailed structureannotations, and can be used for models across multiple architectures andmodalities. Further, it addresses issues such as ambiguity and lack ofconsistency in the annotations via a novel canonicalization and quality controlprocedure. We apply DETR to table extraction for the first time and show thatobject detection models trained on PubTables-1M produce excellent resultsout-of-the-box for all three tasks of detection, structure recognition, andfunctional analysis. It is our hope that PubTables-1M and GriTS can furtherprogress in this area by creating data and metrics suitable for training andevaluating a wide variety of models for table extraction. Data and code will bereleased at https://github.com/microsoft/table-transformer.

Quick Read (beta)

loading the full paper ...