Abstract
Recently, Large Language Models (LLMs) have dominated much of the artificialintelligence scene with their ability to process and generate naturallanguages. However, the majority of LLM research and development remainsEnglish-centric, leaving low-resource languages such as those in the SoutheastAsian (SEA) region under-represented. To address this representation gap, weintroduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edgemultilingual LLMs designed for SEA languages. The SEA-LION family of LLMssupports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese,Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverageslarge-scale multilingual continued pre-training with a comprehensivepost-training regime involving multiple stages of instruction fine-tuning,alignment, and model merging. Evaluation results on multilingual benchmarksindicate that our models achieve state-of-the-art performance across LLMssupporting SEA languages. We open-source the models to benefit the wider SEAcommunity.