Language-agnostic BERT Sentence Embedding

Abstract

We adapt multilingual BERT to produce language-agnostic sentence embeddingsfor 109 languages. %The state-of-the-art for numerous monolingual andmultilingual NLP tasks is masked language model (MLM) pretraining followed bytask specific fine-tuning. While English sentence embeddings have been obtainedby fine-tuning a pretrained BERT model, such models have not been applied tomultilingual sentence embeddings. Our model combines masked language model(MLM) and translation language model (TLM) pretraining with a translationranking task using bi-directional dual encoders. The resulting multilingualsentence embeddings improve average bi-text retrieval accuracy over 112languages to 83.7%, well above the 65.5% achieved by the prior state-of-the-arton Tatoeba. Our sentence embeddings also establish new state-of-the-art resultson BUCC and UN bi-text retrieval.

Quick Read (beta)

loading the full paper ...