Towards Data-Efficient Language Models: A Child-Inspired Approach to Language Learning

Abstract

In this work, we explain our approach employed in the BabyLM Challenge, whichuses various methods of training language models (LMs) with significantly lessdata compared to traditional large language models (LLMs) and are inspired byhow human children learn. While a human child is exposed to far less linguisticinput than an LLM, they still achieve remarkable language understanding andgeneration abilities. To this end, we develop a model trained on a curateddataset consisting of 10 million words, primarily sourced from child-directedtranscripts. The 2024 BabyLM Challenge initial dataset of 10M words is filteredto 8.5M. Next, it is supplemented with a randomly selected subset of TVRdataset consisting of 1.5M words of television dialogues. The latter datasetensures that similar to children, the model is also exposed to language throughmedia. Furthermore, we reduce the vocabulary size to 32,000 tokens, aligning itwith the limited vocabulary of children in the early stages of languageacquisition. We use curriculum learning and is able to match the baseline oncertain benchmarks while surpassing the baseline on others. Additionally,incorporating common LLM training datasets, such as MADLAD-400, degradesperformance. These findings underscore the importance of dataset selection,vocabulary scaling, and curriculum learning in creating more data-efficientlanguage models that better mimic human learning processes.

Quick Read (beta)

loading the full paper ...