PESTO: Switching Point based Dynamic and Relative Positional Encoding for Code-Mixed Languages

Abstract

NLP applications for code-mixed (CM) or mix-lingual text have gained asignificant momentum recently, the main reason being the prevalence of languagemixing in social media communications in multi-lingual societies like India,Mexico, Europe, parts of USA etc. Word embeddings are basic build-ing blocks ofany NLP system today, yet, word embedding for CM languages is an unexploredterritory. The major bottleneck for CM word embeddings is switching points,where the language switches. These locations lack in contextually andstatistical systems fail to model this phenomena due to high variance in theseen examples. In this paper we present our initial observations on applyingswitching point based positional encoding techniques for CM language,specifically Hinglish (Hindi - English). Results are only marginally betterthan SOTA, but it is evident that positional encoding could bean effective wayto train position sensitive language models for CM text.

Quick Read (beta)

loading the full paper ...