Rethinking Optimization and Architecture for Tiny Language Models

Abstract

The power of large language models (LLMs) has been demonstrated throughnumerous data and computing resources. However, the application of languagemodels on mobile devices is facing huge challenge on the computation and memorycosts, that is, tiny language models with high performance are urgentlyrequired. Limited by the highly complex training process, there are manydetails for optimizing language models that are seldom studied carefully. Inthis study, based on a tiny language model with 1B parameters, we carefullydesign a series of empirical study to analyze the effect of each component.Three perspectives are mainly discussed, i.e., neural architecture, parameterinitialization, and optimization strategy. Several design formulas areempirically proved especially effective for tiny language models, includingtokenizer compression, architecture tweaking, parameter inheritance andmultiple-round training. Then we train PanGu-$\pi$-1B Pro and PanGu-$\pi$-1.5BPro on 1.6T multilingual corpora, following the established formulas.Experimental results demonstrate the improved optimization and architectureyield a notable average improvement of 8.87 on benchmark evaluation sets forPanGu-$\pi$-1B Pro. Besides, PanGu-$\pi$-1.5B Pro surpasses a range of SOTAmodels with larger model sizes, validating its superior performance. The codewill be released soon (https://github.com/YuchuanTian/RethinkTinyLM).

Quick Read (beta)

loading the full paper ...