Co-occurrences using Fasttext embeddings for word similarity tasks in Urdu

Abstract

Urdu is a widely spoken language in South Asia. Though immoderate literatureexists for the Urdu language still the data isn't enough to naturally processthe language by NLP techniques. Very efficient language models exist for theEnglish language, a high resource language, but Urdu and other under-resourcedlanguages have been neglected for a long time. To create efficient languagemodels for these languages we must have good word embedding models. For Urdu,we can only find word embeddings trained and developed using the skip-grammodel. In this paper, we have built a corpus for Urdu by scraping andintegrating data from various sources and compiled a vocabulary for the Urdulanguage. We also modify fasttext embeddings and N-Grams models to enabletraining them on our built corpus. We have used these trained embeddings for aword similarity task and compared the results with existing techniques.

Quick Read (beta)

loading the full paper ...