OCR Post Correction for Endangered Language Texts

  • 2020-11-10 21:21:08
  • Shruti Rijhwani, Antonios Anastasopoulos, Graham Neubig
  • 0

Abstract

There is little to no data available to build natural language processingmodels for most endangered languages. However, textual data in these languagesoften exists in formats that are not machine-readable, such as paper books andscanned images. In this work, we address the task of extracting text from theseresources. We create a benchmark dataset of transcriptions for scanned books inthree critically endangered languages and present a systematic analysis of howgeneral-purpose OCR tools are not robust to the data-scarce setting ofendangered languages. We develop an OCR post-correction method tailored to easetraining in this data-scarce setting, reducing the recognition error rate by34% on average across the three languages.

 

Quick Read (beta)

loading the full paper ...