ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations

Abstract

We introduce ReplaceMe, a generalized training-free depth pruning method thateffectively replaces transformer blocks with a linear operation, whilemaintaining high performance for low compression ratios. In contrast toconventional pruning approaches that require additional training orfine-tuning, our approach requires only a small calibration dataset that isused to estimate a linear transformation to approximate the pruned blocks. Thisestimated linear mapping can be seamlessly merged with the remainingtransformer blocks, eliminating the need for any additional network parameters.Our experiments show that ReplaceMe consistently outperforms othertraining-free approaches and remains highly competitive with state-of-the-artpruning methods that involve extensive retraining/fine-tuning and architecturalmodifications. Applied to several large language models (LLMs), ReplaceMeachieves up to 25% pruning while retaining approximately 90% of the originalmodel's performance on open benchmarks - without any training or healing steps,resulting in minimal computational overhead (see Fig.1). We provide anopen-source library implementing ReplaceMe alongside several state-of-the-artdepth pruning techniques, available at this repository.

Quick Read (beta)

loading the full paper ...