word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs

Abstract

We present word2word, a publicly available dataset and an open-source Pythonpackage for cross-lingual word translations extracted from sentence-levelparallel corpora. Our dataset provides top-k word translations in 3,564(directed) language pairs across 62 languages in OpenSubtitles2018 (Lison etal., 2018). To obtain this dataset, we use a count-based bilingual lexiconextraction model based on the observation that not only source and target wordsbut also source words themselves can be highly correlated. We illustrate thatthe resulting bilingual lexicons have high coverage and attain competitivetranslation quality for several language pairs. We wrap our dataset and modelin an easy-to-use Python library, which supports downloading and retrievingtop-k word translations in any of the supported language pairs as well ascomputing top-k word translations for custom parallel corpora.

Quick Read (beta)

loading the full paper ...