Paramanu: A Family of Novel Efficient Indic Generative Foundation Language Models

Abstract

We present Gyan AI Paramanu ("atom"), a family of novel language models forIndian languages. It is a collection of auto-regressive monolingual, bilingual,and multilingual Indic language models pretrained from scratch on a single GPUfor 10 Indian languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi,Odia, Sanskrit, Tamil, Telugu) across 5 scripts (Bangla, Devanagari, Odia,Tamil, Telugu) of varying sizes ranging from 13.29M to 367.5M.The models arepretrained with a context size of 1024 on a single GPU. The models are veryefficient, small, fast, and powerful. We have also developed an efficient mostadvanced Indic tokenizer that can even tokenize unseen languages. In order toavoid the "curse of multi-linguality" in our multilingual mParamanu model, wepretrained on comparable corpora by typological grouping using the same script.We performed human evaluation of our pretrained models for open end textgeneration on grammar, coherence, creativity, and factuality metrics forBangla, Hindi, and Sanskrit. Our Bangla, Hindi, and Sanskrit modelsoutperformed GPT-3.5-Turbo (ChatGPT), Bloom 7B, LLaMa-2 7B, OPT 6.7B, GPT-J 6B,GPTNeo 1.3B, GPT2-XL large language models (LLMs) by a large margin despitebeing smaller in size by 66 to 20 times compared to standard 7B LLMs. To runinference on our pretrained models, CPU is enough, and GPU is not needed. Wealso instruction-tuned our pretrained Bangla, Hindi, Marathi, Tamil, and Telugumodels on 23k instructions in respective languages. Our pretrained andinstruction-tuned models which are first of its kind, most powerful efficientsmall generative language models ever developed for Indic languages, and thevarious results lead to the conclusion that high quality generative languagemodels are possible without high amount of compute power and humongous numberof parameters. We plan to release our models at https://www.bharatgpts.com.

Quick Read (beta)

loading the full paper ...