A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

Abstract

Recent studies have highlighted the potential of exploiting parallel corporato enhance multilingual large language models, improving performance in bothbilingual tasks, e.g., machine translation, and general-purpose tasks, e.g.,text classification. Building upon these findings, our comprehensive study aimsto identify the most effective strategies for leveraging parallel corpora. Weinvestigate the impact of parallel corpora quality and quantity, trainingobjectives, and model size on the performance of multilingual large languagemodels enhanced with parallel corpora across diverse languages and tasks. Ouranalysis reveals several key insights: (i) filtering noisy translations isessential for effectively exploiting parallel corpora, while languageidentification and short sentence filtering have little effect; (ii) even acorpus containing just 10K parallel sentences can yield results comparable tothose obtained from much larger datasets; (iii) employing only the machinetranslation objective yields the best results among various training objectivesand their combinations; (iv) larger multilingual language models benefit morefrom parallel corpora than smaller models due to their stronger capacity forcross-task transfer. Our study offers valuable insights into the optimalutilization of parallel corpora to enhance multilingual large language models,extending the generalizability of previous findings from limited languages andtasks to a broader range of scenarios.

Quick Read (beta)

loading the full paper ...