Subword Mapping and Anchoring across Languages

Abstract

State-of-the-art multilingual systems rely on shared vocabularies thatsufficiently cover all considered languages. To this end, a simple andfrequently used approach makes use of subword vocabularies constructed jointlyover several languages. We hypothesize that such vocabularies are suboptimaldue to false positives (identical subwords with different meanings acrosslanguages) and false negatives (different subwords with similar meanings). Toaddress these issues, we propose Subword Mapping and Anchoring across Languages(SMALA), a method to construct bilingual subword vocabularies. SMALA extractssubword alignments using an unsupervised state-of-the-art mapping technique anduses them to create cross-lingual anchors based on subword similarities. Wedemonstrate the benefits of SMALA for cross-lingual natural language inference(XNLI), where it improves zero-shot transfer to an unseen language withouttask-specific data, but only by sharing subword embeddings. Moreover, in neuralmachine translation, we show that joint subword vocabularies obtained withSMALA lead to higher BLEU scores on sentences that contain many false positivesand false negatives.

Quick Read (beta)

loading the full paper ...