Learning Word Vectors for 157 Languages

  • 2018-02-19 22:32:47
  • Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, Tomas Mikolov
  • 51

Abstract

Distributed word representations, or word vectors, have recently been appliedto many tasks in natural language processing, leading to state-of-the-artperformance. A key ingredient to the successful application of theserepresentations is to train them on very large corpora, and use thesepre-trained models in downstream tasks. In this paper, we describe how wetrained such high quality word representations for 157 languages. We used twosources of data to train these models: the free online encyclopedia Wikipediaand data from the common crawl project. We also introduce three new wordanalogy datasets to evaluate these word vectors, for French, Hindi and Polish.Finally, we evaluate our pre-trained word vectors on 10 languages for whichevaluation datasets exists, showing very strong performance compared toprevious models.

 

Quick Read (beta)

loading the full paper ...