Brno Mobile OCR Dataset

Abstract

We introduce the Brno Mobile OCR Dataset (B-MOD) for document OpticalCharacter Recognition from low-quality images captured by handheld mobiledevices. While OCR of high-quality scanned documents is a mature field wheremany commercial tools are available, and large datasets of text in the wildexist, no existing datasets can be used to develop and test document OCRmethods robust to non-uniform lighting, image blur, strong noise, built-indenoising, sharpening, compression and other artifacts present in manyphotographs from mobile devices. This dataset contains 2 113 unique pages from random scientific papers, whichwere photographed by multiple people using 23 different mobile devices. Theresulting 19 728 photographs of various visual quality are accompanied byprecise positions and text annotations of 500k text lines. We further providean evaluation methodology, including an evaluation server and a testset withnon-public annotations. We provide a state-of-the-art text recognition baseline build onconvolutional and recurrent neural networks trained with Connectionist TemporalClassification loss. This baseline achieves 2 %, 22 % and 73 % word error rateson easy, medium and hard parts of the dataset, respectively, confirming thatthe dataset is challenging. The presented dataset will enable future development and evaluation ofdocument analysis for low-quality images. It is primarily intended forline-level text recognition, and can be further used for line localization,layout analysis, image restoration and text binarization.

Quick Read (beta)

loading the full paper ...