From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages

Abstract

In this paper, we propose a model-agnostic cost-effective approach todeveloping bilingual base large language models (LLMs) to support English andany target language. The method includes vocabulary expansion, initializationof new embeddings, model training and evaluation. We performed our experimentswith three languages, each using a non-Latin script - Ukrainian, Arabic, andGeorgian. Our approach demonstrates improved language performance while reducingcomputational costs. It mitigates the disproportionate penalization ofunderrepresented languages, promoting fairness and minimizing adverse phenomenasuch as code-switching and broken grammar. Additionally, we introduce newmetrics to evaluate language quality, revealing that vocabulary sizesignificantly impacts the quality of generated text.

Quick Read (beta)

loading the full paper ...