OCR Post Correction for Endangered Language Texts

Abstract

There is little to no data available to build natural language processingmodels for most endangered languages. However, textual data in these languagesoften exists in formats that are not machine-readable, such as paper books andscanned images. In this work, we address the task of extracting text from theseresources. We create a benchmark dataset of transcriptions for scanned books inthree critically endangered languages and present a systematic analysis of howgeneral-purpose OCR tools are not robust to the data-scarce setting ofendangered languages. We develop an OCR post-correction method tailored to easetraining in this data-scarce setting, reducing the recognition error rate by34% on average across the three languages.

Quick Read (beta)

loading the full paper ...