Assessing the Role of Data Quality in Training Bilingual Language Models

Abstract

Bilingual and multilingual language models offer a promising path towardscaling NLP systems across diverse languages and users. However, theirperformance often varies wildly between languages as prior works show thatadding more languages can degrade performance for some languages (such asEnglish), while improving others (typically more data constrained languages).In this work, we investigate causes of these inconsistencies by comparingbilingual and monolingual language models. Our analysis reveals that unequaldata quality, not just data quantity, is a major driver of performancedegradation in bilingual settings. We propose a simple yet effective datafiltering strategy to select higher-quality bilingual training data with onlyhigh quality English data. Applied to French, German, and Chinese, our approachimproves monolingual performance by 2-4% and reduces bilingual modelperformance gaps to 1%. These results highlight the overlooked importance ofdata quality in multilingual pretraining and offer a practical recipe forbalancing performance.

Quick Read (beta)

loading the full paper ...