Abstract
Recently, pre-trained Transformer based language models such as BERT and GPT,have shown great improvement in many Natural Language Processing (NLP) tasks.However, these models contain a large amount of parameters. The emergence ofeven larger and more accurate models such as GPT2 and Megatron, suggest a trendof large pre-trained Transformer models. However, using these large models inproduction environments is a complex task requiring a large amount of compute,memory and power resources. In this work we show how to performquantization-aware training during the fine-tuning phase of BERT in order tocompress BERT by $4\times$ with minimal accuracy loss. Furthermore, theproduced quantized model can accelerate inference speed if it is optimized for8bit Integer supporting hardware.