CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models

Abstract

The digitisation of historical print media archives is crucial for increasingaccessibility to contemporary records. However, the process of OpticalCharacter Recognition (OCR) used to convert physical records to digital text isprone to errors, particularly in the case of newspapers and periodicals due totheir complex layouts. This paper introduces Context Leveraging OCR Correction(CLOCR-C), which utilises the infilling and context-adaptive abilities oftransformer-based language models (LMs) to improve OCR quality. The study aimsto determine if LMs can perform post-OCR correction, improve downstream NLPtasks, and the value of providing the socio-cultural context as part of thecorrection process. Experiments were conducted using seven LMs on threedatasets: the 19th Century Serials Edition (NCSE) and two datasets from theOverproof collection. The results demonstrate that some LMs can significantlyreduce error rates, with the top-performing model achieving over a 60%reduction in character error rate on the NCSE dataset. The OCR improvementsextend to downstream tasks, such as Named Entity Recognition, with increasedCosine Named Entity Similarity. Furthermore, the study shows that providingsocio-cultural context in the prompts improves performance, while misleadingprompts lower performance. In addition to the findings, this study releases adataset of 91 transcribed articles from the NCSE, containing a total of 40thousand words, to support further research in this area. The findings suggestthat CLOCR-C is a promising approach for enhancing the quality of existingdigital archives by leveraging the socio-cultural information embedded in theLMs and the text requiring correction.

Quick Read (beta)

loading the full paper ...