LSCP: Enhanced Large Scale Colloquial Persian Language Understanding

Abstract

Language recognition has been significantly advanced in recent years by meansof modern machine learning methods such as deep learning and benchmarks withrich annotations. However, research is still limited in low-resource formallanguages. This consists of a significant gap in describing the colloquiallanguage especially for low-resourced ones such as Persian. In order to targetthis gap for low resource languages, we propose a "Large Scale ColloquialPersian Dataset" (LSCP). LSCP is hierarchically organized in a semantictaxonomy that focuses on multi-task informal Persian language understanding asa comprehensive problem. This encompasses the recognition of multiple semanticaspects in the human-level sentences, which naturally captures from thereal-world sentences. We believe that further investigations and processing, aswell as the application of novel algorithms and methods, can strengthenenriching computerized understanding and processing of low resource languages.The proposed corpus consists of 120M sentences resulted from 27M tweetsannotated with parsing tree, part-of-speech tags, sentiment polarity andtranslation in five different languages.

Quick Read (beta)

loading the full paper ...