Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences

Abstract

Training code-switched language models is difficult due to lack of data andcomplexity in the grammatical structure. Linguistic constraint theories havebeen used for decades to generate artificial code-switching sentences to copewith this issue. However, this require external word alignments or constituencyparsers that create erroneous results on distant languages. We propose asequence-to-sequence model using a copy mechanism to generate code-switchingdata by leveraging parallel monolingual translations from a limited source ofcode-switching data. The model learns how to combine words from parallelsentences and identifies when to switch one language to the other. Moreover, itcaptures code-switching constraints by attending and aligning the words ininputs, without requiring any external knowledge. Based on experimentalresults, the language model trained with the generated sentences achievesstate-of-the-art performance and improves end-to-end automatic speechrecognition.

Quick Read (beta)

loading the full paper ...