From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages

  • 2024-10-24 16:20:54
  • Artur Kiulian, Anton Polishko, Mykola Khandoga, Yevhen Kostiuk, Guillermo Gabrielli, Łukasz Gagała, Fadi Zaraket, Qusai Abu Obaida, Hrishikesh Garud, Wendy Wing Yee Mak, Dmytro Chaplynskyi, Selma Belhadj Amor, Grigol Peradze
  • 0

Abstract

In this paper, we propose a model-agnostic cost-effective approach todeveloping bilingual base large language models (LLMs) to support English andany target language. The method includes vocabulary expansion, initializationof new embeddings, model training and evaluation. We performed our experimentswith three languages, each using a non-Latin script - Ukrainian, Arabic, andGeorgian. Our approach demonstrates improved language performance while reducingcomputational costs. It mitigates the disproportionate penalization ofunderrepresented languages, promoting fairness and minimizing adverse phenomenasuch as code-switching and broken grammar. Additionally, we introduce newmetrics to evaluate language quality, revealing that vocabulary sizesignificantly impacts the quality of generated text.

 

Quick Read (beta)

loading the full paper ...