Optimizing Inference Performance of Transformers on CPUs

Abstract

The Transformer architecture revolutionized the field of natural languageprocessing (NLP). Transformers-based models (e.g., BERT) power many importantWeb services, such as search, translation, question-answering, etc. Whileenormous research attention is paid to the training of those models, relativelylittle efforts are made to improve their inference performance. This papercomes to address this gap by presenting an empirical analysis of scalabilityand performance of inferencing a Transformer-based model on CPUs. Focusing onthe highly popular BERT model, we identify key components of the Transformerarchitecture where the bulk of the computation happens, and propose threeoptimizations to speed them up. The optimizations are evaluated using theinference benchmark from HuggingFace, and are shown to achieve the speedup ofup to x2.36. The considered optimizations do not require any changes to theimplementation of the models nor affect their accuracy.

Quick Read (beta)

loading the full paper ...