Improving Multilingual Semantic Textual Similarity with Shared Sentence Encoder for Low-resource Languages

Abstract

Measuring the semantic similarity between two sentences (or Semantic TextualSimilarity - STS) is fundamental in many NLP applications. Despite theremarkable results in supervised settings with adequate labeling, littleattention has been paid to this task in low-resource languages withinsufficient labeling. Existing approaches mostly leverage machine translationtechniques to translate sentences into rich-resource language. These approacheseither beget language biases, or be impractical in industrial applicationswhere spoken language scenario is more often and rigorous efficiency isrequired. In this work, we propose a multilingual framework to tackle the STStask in a low-resource language e.g. Spanish, Arabic , Indonesian and Thai, byutilizing the rich annotation data in a rich resource language, e.g. English.Our approach is extended from a basic monolingual STS framework to a sharedmultilingual encoder pretrained with translation task to incorporaterich-resource language data. By exploiting the nature of a shared multilingualencoder, one sentence can have multiple representations for different targettranslation language, which are used in an ensemble model to improve similarityevaluation. We demonstrate the superiority of our method over other state ofthe art approaches on SemEval STS task by its significant improvement on non-MTmethod, as well as an online industrial product where MT method fails to beatbaseline while our approach still has consistently improvements.

Quick Read (beta)

loading the full paper ...