Subword-Level Language Identification for Intra-Word Code-Switching

Abstract

Language identification for code-switching (CS), the phenomenon ofalternating between two or more languages in conversations, has traditionallybeen approached under the assumption of a single language per token. However,if at least one language is morphologically rich, a large number of words canbe composed of morphemes from more than one language (intra-word CS). In thispaper, we extend the language identification task to the subword-level, suchthat it includes splitting mixed words while tagging each part with a languageID. We further propose a model for this task, which is based on a segmentalrecurrent neural network. In experiments on a new Spanish--Wixarika dataset andon an adapted German--Turkish dataset, our proposed model performs slightlybetter than or roughly on par with our best baseline, respectively. Consideringonly mixed words, however, it strongly outperforms all baselines.

Quick Read (beta)

loading the full paper ...