Donut: Document Understanding Transformer without OCR

Abstract

Understanding document images (e.g., invoices) has been an important researchtopic and has many applications in document processing automation. Through thelatest advances in deep learning-based Optical Character Recognition (OCR),current Visual Document Understanding (VDU) systems have come to be designedbased on OCR. Although such OCR-based approach promise reasonable performance,they suffer from critical problems induced by the OCR, e.g., (1) expensivecomputational costs and (2) performance degradation due to the OCR errorpropagation. In this paper, we propose a novel VDU model that is end-to-endtrainable without underpinning OCR framework. To this end, we propose a newtask and a synthetic document image generator to pre-train the model tomitigate the dependencies on large-scale real document images. Our approachachieves state-of-the-art performance on various document understanding tasksin public benchmark datasets and private industrial service datasets. Throughextensive experiments and analysis, we demonstrate the effectiveness of theproposed model especially with consideration for a real-world application.

Quick Read (beta)

loading the full paper ...