TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment

Abstract

Training language models (LMs) and their application agents is increasinglycostly due to large datasets and models, making test failures difficult tobear. Simplified language environments serve as primordial training and testinggrounds, retaining essential commonsense and communication skills but in a moredigestible form, potentially enhancing the learning efficiency of LMs, and thusreducing the required model size and data volume for effective training andevaluation. In these simplified language environments, workable strategies forsmall models, datasets, and agents may be adaptable to larger models, datasets,and agents in complex language environments. To create such environments, we focus on two aspects: i) minimizing languagedataset noise and complexity, and ii) preserving the essential textdistribution characteristics. Unlike previous methods, we propose a pipeline torefine text data by eliminating noise, minimizing vocabulary, and maintaininggenre-specific patterns (e.g., for books, conversation, code, etc.).Implementing this pipeline with large LMs, we have created a leaner suite of LMtraining and evaluation datasets: 71M Leaner-Pretrain, 7M Leaner-Instruct,Leaner-Glue for assessing linguistic proficiency, and Leaner-Eval for testinginstruction-following ability. Our experiments show that leaner pre-training boosts LM learning efficiency.Tiny LMs trained on these datasets outperform those trained on originaldatasets in instruction-following across different language granularity levels.Moreover, the Leaner-Pretrain dataset's alignment with conventional large LMtraining sets enables resource-optimized analysis of how learning objectives,model architectures, and training techniques impact performance on languagemodeling and downstream tasks. Our code and datasets are available athttps://github.com/EmpathYang/TinyHelen.git.

Quick Read (beta)

loading the full paper ...