Komodo: A Linguistic Expedition into Indonesia's Regional Languages

  • 2024-03-19 06:49:01
  • Louis Owen, Vishesh Tripathi, Abhay Kumar, Biddwan Ahmed
  • 0

Abstract

The recent breakthroughs in Large Language Models (LLMs) have mostly focusedon languages with easily available and sufficient resources, such as English.However, there remains a significant gap for languages that lack sufficientlinguistic resources in the public domain. Our work introduces Komodo-7B,7-billion-parameter Large Language Models designed to address this gap byseamlessly operating across Indonesian, English, and 11 regional languages inIndonesia. Komodo-7B is a family of LLMs that consist of Komodo-7B-Base andKomodo-7B-Instruct. Komodo-7B-Instruct stands out by achieving state-of-the-artperformance in various tasks and languages, outperforming the benchmarks set byOpenAI's GPT-3.5, Cohere's Aya-101, Llama-2-Chat-13B,Mixtral-8x7B-Instruct-v0.1, Gemma-7B-it , and many more. This model not onlydemonstrates superior performance in both language-specific and overallassessments but also highlights its capability to excel in linguisticdiversity. Our commitment to advancing language models extends beyondwell-resourced languages, aiming to bridge the gap for those with limitedlinguistic assets. Additionally, Komodo-7B-Instruct's better cross-languageunderstanding contributes to addressing educational disparities in Indonesia,offering direct translations from English to 11 regional languages, asignificant improvement compared to existing language translation services.Komodo-7B represents a crucial step towards inclusivity and effectiveness inlanguage models, providing to the linguistic needs of diverse communities.

 

Quick Read (beta)

loading the full paper ...