Abstract
Transformer-based language models such as BERT provide significant accuracyimprovement to a multitude of natural language processing (NLP) tasks. However,their hefty computational and memory demands make them challenging to deploy toresource-constrained edge platforms with strict latency requirements. We present EdgeBERT an in-depth and principled algorithm and hardware designmethodology to achieve minimal latency and energy consumption on multi-task NLPinference. Compared to the ALBERT baseline, we achieve up to 2.4x and 13.4xinference latency and memory savings, respectively, with less than 1%-pt dropin accuracy on several GLUE benchmarks by employing a calibrated combination of1) entropy-based early stopping, 2) adaptive attention span, 3) movement andmagnitude pruning, and 4) floating-point quantization. Furthermore, in order to maximize the benefits of these algorithms inalways-on and intermediate edge computing settings, we specialize a scalablehardware architecture wherein floating-point bit encodings of the shareablemulti-task embedding parameters are stored in high-density non-volatile memory.Altogether, EdgeBERT enables fully on-chip inference acceleration of NLPworkloads with 5.2x, and 157x lower energy than that of an un-optimizedaccelerator and CUDA adaptations on an Nvidia Jetson Tegra X2 mobile GPU,respectively.