Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

Abstract

In this study, we introduce CT-LLM, a 2B large language model (LLM) thatillustrates a pivotal shift towards prioritizing the Chinese language indeveloping LLMs. Uniquely initiated from scratch, CT-LLM diverges from theconventional methodology by primarily incorporating Chinese textual data,utilizing an extensive corpus of 1,200 billion tokens, including 800 billionChinese tokens, 300 billion English tokens, and 100 billion code tokens. Thisstrategic composition facilitates the model's exceptional proficiency inunderstanding and processing Chinese, a capability further enhanced throughalignment techniques. Demonstrating remarkable performance on the CHC-Bench,CT-LLM excels in Chinese language tasks, and showcases its adeptness in Englishthrough SFT. This research challenges the prevailing paradigm of training LLMspredominantly on English corpora and then adapting them to other languages,broadening the horizons for LLM training methodologies. By open-sourcing thefull process of training a Chinese LLM, including a detailed data processingprocedure with the obtained Massive Appropriate Pretraining Chinese Corpus(MAP-CC), a well-chosen multidisciplinary Chinese Hard Case Benchmark(CHC-Bench), and the 2B-size Chinese Tiny LLM (CT-LLM), we aim to fosterfurther exploration and innovation in both academia and industry, paving theway for more inclusive and versatile language models.

Quick Read (beta)

loading the full paper ...