BERT2DNN: BERT Distillation with Massive Unlabeled Data for Online E-Commerce Search

  • 2020-10-20 16:56:04
  • Yunjiang Jiang, Yue Shang, Ziyang Liu, Hongwei Shen, Yun Xiao, Wei Xiong, Sulong Xu, Weipeng Yan, Di Jin
  • 3

Abstract

Relevance has significant impact on user experience and business profit fore-commerce search platform. In this work, we propose a data-driven frameworkfor search relevance prediction, by distilling knowledge from BERT and relatedmulti-layer Transformer teacher models into simple feed-forward networks withlarge amount of unlabeled data. The distillation process produces a studentmodel that recovers more than 97\% test accuracy of teacher models on newqueries, at a serving cost that's several magnitude lower (latency 150x lowerthan BERT-Base and 15x lower than the most efficient BERT variant, TinyBERT).The applications of temperature rescaling and teacher model stacking furtherboost model accuracy, without increasing the student model complexity. We present experimental results on both in-house e-commerce search relevancedata as well as a public data set on sentiment analysis from the GLUEbenchmark. The latter takes advantage of another related public data set ofmuch larger scale, while disregarding its potentially noisy labels. Embeddinganalysis and case study on the in-house data further highlight the strength ofthe resulting model. By making the data processing and model training sourcecode public, we hope the techniques presented here can help reduce energyconsumption of the state of the art Transformer models and also level theplaying field for small organizations lacking access to cutting edge machinelearning hardwares.

 

Quick Read (beta)

loading the full paper ...