A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

Abstract

We use the multilingual OSCAR corpus, extracted from Common Crawl vialanguage classification, filtering and cleaning, to train monolingualcontextualized word embeddings (ELMo) for five mid-resource languages. We thencompare the performance of OSCAR-based and Wikipedia-based ELMo embeddings forthese languages on the part-of-speech tagging and parsing tasks. We show that,despite the noise in the Common-Crawl-based OSCAR data, embeddings trained onOSCAR perform much better than monolingual embeddings trained on Wikipedia.They actually equal or improve the current state of the art in tagging andparsing for all five languages. In particular, they also improve overmultilingual Wikipedia-based contextual embeddings (multilingual BERT), whichalmost always constitutes the previous state of the art, thereby showing thatthe benefit of a larger, more diverse corpus surpasses the cross-lingualbenefit of multilingual embedding architectures.

Quick Read (beta)

loading the full paper ...