Abstract
Multi-modal document pre-trained models have proven to be very effective in avariety of visually-rich document understanding (VrDU) tasks. Though existingdocument pre-trained models have achieved excellent performance on standardbenchmarks for VrDU, the way they model and exploit the interactions betweenvision and language on documents has hindered them from better generalizationability and higher accuracy. In this work, we investigate the problem ofvision-language joint representation learning for VrDU mainly from theperspective of supervisory signals. Specifically, a pre-training paradigmcalled Bi-VLDoc is proposed, in which a bidirectional vision-languagesupervision strategy and a vision-language hybrid-attention mechanism aredevised to fully explore and utilize the interactions between these twomodalities, to learn stronger cross-modal document representations with richersemantics. Benefiting from the learned informative cross-modal documentrepresentations, Bi-VLDoc significantly advances the state-of-the-artperformance on three widely-used document understanding benchmarks, includingForm Understanding (from 85.14% to 93.44%), Receipt Information Extraction(from 96.01% to 97.84%), and Document Classification (from 96.08% to 97.12%).On Document Visual QA, Bi-VLDoc achieves the state-of-the-art performancecompared to previous single model methods.