Abstract
Large Language Models (LLMs) have achieved remarkable success, yet recentfindings reveal that their deeper layers often contribute minimally and can bepruned without affecting overall performance. While some view this as anopportunity for model compression, we identify it as a training shortfallrooted in the widespread use of Pre-Layer Normalization (Pre-LN). Wedemonstrate that Pre-LN, commonly employed in models like GPT and LLaMA, leadsto diminished gradient norms in its deeper layers, reducing theireffectiveness. In contrast, Post-Layer Normalization (Post-LN) preserves largergradient norms in deeper layers but suffers from vanishing gradients in earlierlayers. To address this, we introduce Mix-LN, a novel normalization techniquethat combines the strengths of Pre-LN and Post-LN within the same model. Mix-LNapplies Post-LN to the earlier layers and Pre-LN to the deeper layers, ensuringmore uniform gradients across layers. This allows all parts of thenetwork--both shallow and deep layers--to contribute effectively to training.Extensive experiments with various model sizes from 70M to 7B demonstrate thatMix-LN consistently outperforms both Pre-LN and Post-LN, promoting morebalanced, healthier gradient norms throughout the network, and enhancing theoverall quality of LLM pre-training. Furthermore, we demonstrate that modelspre-trained with Mix-LN learn better compared to those using Pre-LN or Post-LNduring supervised fine-tuning (SFT) and reinforcement learning from humanfeedback (RLHF), highlighting the critical importance of high-quality deeplayers. By effectively addressing the inefficiencies of deep layers in currentLLMs, Mix-LN unlocks their potential, enhancing model capacity withoutincreasing model size. Our code is available athttps://github.com/pixeli99/MixLN.