Noisy Parallel Data Alignment

Abstract

An ongoing challenge in current natural language processing is how its majoradvancements tend to disproportionately favor resource-rich languages, leavinga significant number of under-resourced languages behind. Due to the lack ofresources required to train and evaluate models, most modern languagetechnologies are either nonexistent or unreliable to process endangered, local,and non-standardized languages. Optical character recognition (OCR) is oftenused to convert endangered language documents into machine-readable data.However, such OCR output is typically noisy, and most word alignment models arenot built to work under such noisy conditions. In this work, we study theexisting word-level alignment models under noisy settings and aim to make themmore robust to noisy data. Our noise simulation and structural biasing method,tested on multiple language pairs, manages to reduce the alignment error rateon a state-of-the-art neural-based alignment model up to 59.6%.

Quick Read (beta)

loading the full paper ...