Abstract
There is little to no data available to build natural language processingmodels for most endangered languages. However, textual data in these languagesoften exists in formats that are not machine-readable, such as paper books andscanned images. In this work, we address the task of extracting text from theseresources. We create a benchmark dataset of transcriptions for scanned books inthree critically endangered languages and present a systematic analysis of howgeneral-purpose OCR tools are not robust to the data-scarce setting ofendangered languages. We develop an OCR post-correction method tailored to easetraining in this data-scarce setting, reducing the recognition error rate by34% on average across the three languages.
Quick Read (beta)
loading the full paper ...