Abstract
We present "Paramanu", a family of novel language models (LM) for Indianlanguages, consisting of auto-regressive monolingual, bilingual, andmultilingual models pretrained from scratch. Currently, it covers 10 languages(Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil,Telugu) across 5 scripts (Bangla, Devanagari, Odia, Tamil, Telugu). The modelsare pretrained on a single GPU with context size of 1024 and vary in size from13.29 million (M) to 367.5 M parameters. We proposed a RoPE embedding scalingmethod that enables us to pretrain language models from scratch at largersequence length context size than typical GPU memory permits. We alsointroduced a novel efficient Indic tokenizer, "mBharat", using a combination ofBPE and Unigram, achieving the least fertility score and the ability totokenize unseen languages in both the same script & Roman script. We alsoproposed and performed language-specific tokenization for multilingual models &domain-specific tokenization for monolingual models. To address the "curse ofmultilinguality" in our mParamanu model, we pretrained on comparable corporabased on typological grouping within the same script. Our findings show alanguage transfer phenomenon from low-resource to high-resource languageswithin languages of the same script & typology. Human evaluations foropen-ended text generation demonstrated that Paramanu models outperformedseveral LLMs, despite being 20 to 64 times smaller. We createdinstruction-tuning datasets & instruction-tuned our models on 23,000instructions in respective languages. Comparisons with multilingual LLMs acrossvarious benchmarks for natural language (NL) understanding, NL inference, &reading comprehension highlight the advantages of our models; leads to theconclusion that high quality generative LM are possible without high amount ofcompute power & enormous number of parameters.