Abstract
Optical Character Recognition (OCR) plays a crucial role in digitizinghistorical and multilingual documents, yet OCR errors - imperfect extraction oftext, including character insertion, deletion, and substitution cansignificantly impact downstream tasks like question-answering (QA). In thiswork, we conduct a comprehensive analysis of how OCR-induced noise affects theperformance of Multilingual QA Systems. To support this analysis, we introducea multilingual QA dataset MultiOCR-QA, comprising 50K question-answer pairsacross three languages, English, French, and German. The dataset is curatedfrom OCR-ed historical documents, which include different levels and types ofOCR noise. We then evaluate how different state-of-the-art Large LanguageModels (LLMs) perform under different error conditions, focusing on three majorOCR error types. Our findings show that QA systems are highly prone toOCR-induced errors and perform poorly on noisy OCR text. By comparing modelperformance on clean versus noisy texts, we provide insights into thelimitations of current approaches and emphasize the need for morenoise-resilient QA systems in historical digitization contexts.