DriveThru: a Document Extraction Platform and Benchmark Datasets for Indonesian Local Language Archives

Abstract

Indonesia is one of the most diverse countries linguistically. However,despite this linguistic diversity, Indonesian languages remain underrepresentedin Natural Language Processing (NLP) research and technologies. In the past twoyears, several efforts have been conducted to construct NLP resources forIndonesian languages. However, most of these efforts have been focused oncreating manual resources thus difficult to scale to more languages. Althoughmany Indonesian languages do not have a web presence, locally there areresources that document these languages well in printed forms such as books,magazines, and newspapers. Digitizing these existing resources will enablescaling of Indonesian language resource construction to many more languages. Inthis paper, we propose an alternative method of creating datasets by digitizingdocuments, which have not previously been used to build digital languageresources in Indonesia. DriveThru is a platform for extracting document contentutilizing Optical Character Recognition (OCR) techniques in its system toprovide language resource building with less manual effort and cost. This paperalso studies the utility of current state-of-the-art LLM for post-OCRcorrection to show the capability of increasing the character accuracy rate(CAR) and word accuracy rate (WAR) compared to off-the-shelf OCR.

Quick Read (beta)

loading the full paper ...