Synthetic Document Generator for Annotation-free Layout Recognition

Abstract

Analyzing the layout of a document to identify headers, sections, tables,figures etc. is critical to understanding its content. Deep learning basedapproaches for detecting the layout structure of document images have beenpromising. However, these methods require a large number of annotated examplesduring training, which are both expensive and time consuming to obtain. Wedescribe here a synthetic document generator that automatically producesrealistic documents with labels for spatial positions, extents and categoriesof the layout elements. The proposed generative process treats every physicalcomponent of a document as a random variable and models their intrinsicdependencies using a Bayesian Network graph. Our hierarchical formulation usingstochastic templates allow parameter sharing between documents for retainingbroad themes and yet the distributional characteristics produces visuallyunique samples, thereby capturing complex and diverse layouts. We empiricallyillustrate that a deep layout detection model trained purely on the syntheticdocuments can match the performance of a model that uses real documents.

Quick Read (beta)

loading the full paper ...