Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

Abstract

Text classification must sometimes be applied in a low-resource language withno labeled training data. However, training data may be available in a relatedlanguage. We investigate whether character-level knowledge transfer from arelated language helps text classification. We present a cross-lingual documentclassification framework (CACO) that exploits cross-lingual subword similarityby jointly training a character-based embedder and a word-based classifier. Theembedder derives vector representations for input words from their writtenforms, and the classifier makes predictions based on the word vectors. We use ajoint character representation for both the source language and the targetlanguage, which allows the embedder to generalize knowledge about sourcelanguage words to target language words with similar forms. We propose amulti-task objective that can further improve the model if additionalcross-lingual or monolingual resources are available. Experiments confirm thatcharacter-level knowledge transfer is more data-efficient than word-leveltransfer between related languages.

Quick Read (beta)

loading the full paper ...