Language models (LMs) are powerful tools for natural language processing, butthey often struggle to produce coherent and fluent text when they are small.Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) canrarely generate coherent and consistent English text beyond a few words evenafter extensive training. This raises the question of whether the emergence ofthe ability to produce coherent English text only occurs at larger scales (withhundreds of millions of parameters or more) and complex architectures (withmany layers of global attention). In this work, we introduce TinyStories, a synthetic dataset of short storiesthat only contain words that a typical 3 to 4-year-olds usually understand,generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to trainand evaluate LMs that are much smaller than the state-of-the-art models (below10 million total parameters), or have much simpler architectures (with only onetransformer block), yet still produce fluent and consistent stories withseveral paragraphs that are diverse and have almost perfect grammar, anddemonstrate reasoning capabilities. We also introduce a new paradigm for the evaluation of language models: Wesuggest a framework which uses GPT-4 to grade the content generated by thesemodels as if those were stories written by students and graded by a (human)teacher. This new paradigm overcomes the flaws of standard benchmarks whichoften requires the model's output to be very structures, and moreover providesa multidimensional score for the model, providing scores for differentcapabilities such as grammar, creativity and consistency. We hope that TinyStories can facilitate the development, analysis andresearch of LMs, especially for low-resource or specialized domains, and shedlight on the emergence of language capabilities in LMs.