Abstract
We present a novel approach to data preparation for developing multilingualIndic large language model. Our meticulous data acquisition spans open-sourceand proprietary sources, including Common Crawl, Indic books, news articles,and Wikipedia, ensuring a diverse and rich linguistic representation. For eachIndic language, we design a custom preprocessing pipeline to effectivelyeliminate redundant and low-quality text content. Additionally, we performdeduplication on Common Crawl data to address the redundancy present in 70% ofthe crawled web pages. This study focuses on developing high-quality data,optimizing tokenization for our multilingual dataset for Indic large languagemodels with 3B and 7B parameters, engineered for superior performance in Indiclanguages. We introduce a novel multilingual tokenizer training strategy,demonstrating our custom-trained Indic tokenizer outperforms thestate-of-the-art OpenAI Tiktoken tokenizer, achieving a superior token-to-wordratio for Indic languages.