MaLLaM -- Malaysia Large Language Model

Abstract

Addressing the gap in Large Language Model pretrained from scratch withMalaysian context, We trained models with 1.1 billion, 3 billion, and 5 billionparameters on a substantial 349GB dataset, equivalent to 90 billion tokensbased on our pretrained Byte Pair Encoding (BPE) tokenizer for a single epoch.MaLLaM contributes to enhanced natural language understanding and generationtasks in the Malay language. Although trained on a smaller dataset of 90billion tokens, our instruction-tuned MaLLaM models perform competitively. Whencompared to ChatGPT3.5 and Malaysian Mistral, MaLLaM's instruction-tuned modelsdemonstrate notable proficiency, underscoring the effectiveness of our approachin capturing and understanding the nuances of the Malaysian language. MaLLaMmodels mark a significant contribution to the field, providing comprehensivelanguage representations grounded in Malaysian context. This endeavor aims topave the way for enhanced natural language understanding and generation tasksspecific to the linguistic nuances present in Malaysia. We discuss the trainingmethodology, dataset composition, and the potential impact of MaLLaM inadvancing the capabilities of large language models within the context of theMalay language. All models released athttps://huggingface.co/collections/mesolitica/mallam-6577b59d1e0b436ae75f930f

Quick Read (beta)

loading the full paper ...