Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance

Abstract

Small Language Models (SLMs) offer efficient alternatives to LLMs forspecific domains. The 2023 TinyStories study developed an English dataset thatallows SLMs with 1 to 10 million parameters to produce coherent outputs. Ourresearch expands this framework by translating the original dataset into Indianlanguages and creating synthetic data using LLMs. We focus on Hindi, Marathi,and Bengali, evaluating SLMs for regional language processing and understandinglinguistic complexity. We show that SLMs efficiently process regional languageswith significantly fewer parameters than LLMs, providing a complementaryframework for ``inference based evaluation" of tokenization strategies andlinguistic complexity. Our analysis shows that language-specific tokenizersoutperform general-purpose ones for Indian languages. Empirical validations,supported by information-theoretic and morphological analyses, providesfundamental understanding behind the better performance of Hindi models overMarathi and Bengali. Additionally, we show that synthetic datasets outperformtranslated content for training SLMs. Correlation analyses revealcross-linguistic patterns and language-specific relationships betweencreativity, grammatical precision, and narrative completeness. These findingsadvance both the practical application of SLMs to underserved languages and ourtheoretical understanding of neural language development.

Quick Read (beta)

loading the full paper ...