ALLaM: Large Language Models for Arabic and English

Abstract

We present ALLaM: Arabic Large Language Model, a series of large languagemodels to support the ecosystem of Arabic Language Technologies (ALT). ALLaM iscarefully trained considering the values of language alignment and knowledgetransfer at scale. Our autoregressive decoder-only architecture modelsdemonstrate how second-language acquisition via vocabulary expansion andpretraining on a mixture of Arabic and English text can steer a model towards anew language (Arabic) without any catastrophic forgetting in the originallanguage (English). Furthermore, we highlight the effectiveness of usingparallel/translated data to aid the process of knowledge alignment betweenlanguages. Finally, we show that extensive alignment with human preferences cansignificantly enhance the performance of a language model compared to models ofa larger scale with lower quality alignment. ALLaM achieves state-of-the-artperformance in various Arabic benchmarks, including MMLU Arabic, ACVA, andArabic Exams. Our aligned models improve both in Arabic and English from theirbase aligned models.

Quick Read (beta)

loading the full paper ...