Normalizing Text using Language Modelling based on Phonetics and String Similarity

Abstract

Social media networks and chatting platforms often use an informal version ofnatural text. Adversarial spelling attacks also tend to alter the input text bymodifying the characters in the text. Normalizing these texts is an essentialstep for various applications like language translation and text to speechsynthesis where the models are trained over clean regular English language. Wepropose a new robust model to perform text normalization. Our system uses the BERT language model to predict the masked words thatcorrespond to the unnormalized words. We propose two unique masking strategiesthat try to replace the unnormalized words in the text with their root formusing a unique score based on phonetic and string similarity metrics.We usehuman-centric evaluations where volunteers were asked to rank the normalizedtext. Our strategies yield an accuracy of 86.7% and 83.2% which indicates theeffectiveness of our system in dealing with text normalization.

Quick Read (beta)

loading the full paper ...