Abstract
We present ALLaM: Arabic Large Language Model, a series of large languagemodels to support the ecosystem of Arabic Language Technologies (ALT). ALLaM iscarefully trained considering the values of language alignment and knowledgetransfer at scale. Our autoregressive decoder-only architecture modelsdemonstrate how second-language acquisition via vocabulary expansion andpretraining on a mixture of Arabic and English text can steer a model towards anew language (Arabic) without any catastrophic forgetting in the originallanguage (English). Furthermore, we highlight the effectiveness of usingparallel/translated data to aid the process of knowledge alignment betweenlanguages. Finally, we show that extensive alignment with human preferences cansignificantly enhance the performance of a language model compared to models ofa larger scale with lower quality alignment. ALLaM achieves state-of-the-artperformance in various Arabic benchmarks, including MMLU Arabic, ACVA, andArabic Exams. Our aligned models improve both in Arabic and English from theirbase aligned models.