Attention based Bidirectional GRU hybrid model for inappropriate content detection in Urdu language

Abstract

With the increased use of the internet and social networks for onlinediscussions, the spread of toxic and inappropriate content on social networkingsites has also increased. Several studies have been conducted in differentlanguages. However, there is less work done for South Asian languages forinappropriate content identification using deep learning techniques. In Urdulanguage, the spellings are not unique, and people write different commonspellings for the same word, while mixing it other languages, like English inthe text makes it more challenging, and limited research work is available toprocess such language with the finest algorithms. The use of attention layerwith a deep learning model can help handling the long-term dependencies andincrease its efficiency . To explore the effects of the attention layer, thisstudy proposes attention-based Bidirectional GRU hybrid model for identifyinginappropriate content in Urdu Unicode text language. Four different baselinedeep learning models; LSTM, Bi-LSTM, GRU, and TCN, are used to compare theperformance of the proposed model. The results of these models were comparedbased on evaluation metrics, dataset size, and impact of the word embeddinglayer. The pre-trained Urdu word2Vec embeddings were utilized for our case. Ourproposed model BiGRU-A outperformed all other baseline models by yielding 84\%accuracy without using pre-trained word2Vec layer. From our experiments, wehave established that the attention layer improves the model's efficiency, andpre-trained word2Vec embedding does not work well with an inappropriate contentdataset.

Quick Read (beta)

loading the full paper ...