BinaryBERT: Pushing the Limit of BERT Quantization

Abstract

The rapid development of large pre-trained language models has greatlyincreased the demand for model compression techniques, among which quantizationis a popular solution. In this paper, we propose BinaryBERT, which pushes BERTquantization to the limit with weight binarization. We find that a binary BERTis hard to be trained directly than a ternary counterpart due to its complexand irregular loss landscapes. Therefore, we propose ternary weight splitting,which initializes the binary model by equivalent splitting from a half-sizedternary network. The binary model thus inherits the good performance of theternary model, and can be further enhanced by fine-tuning the new architectureafter splitting. Empirical results show that BinaryBERT has negligibleperformance drop compared to the full-precision BERT-base while being$24\times$ smaller, achieving the state-of-the-art results on GLUE and SQuADbenchmarks.

Quick Read (beta)

loading the full paper ...