FarsTail: A Persian Natural Language Inference Dataset

Abstract

Natural language inference (NLI) is known as one of the central tasks innatural language processing (NLP) which encapsulates many fundamental aspectsof language understanding. With the considerable achievements of data-hungrydeep learning methods in NLP tasks, a great amount of effort has been devotedto develop more diverse datasets for different languages. In this paper, wepresent a new dataset for the NLI task in the Persian language, also known asFarsi, which is one of the dominant languages in the Middle East. This dataset,named FarsTail, includes 10,367 samples which are provided in both the Persianlanguage as well as the indexed format to be useful for non-Persianresearchers. The samples are generated from 3,539 multiple-choice questionswith the least amount of annotator interventions in a way similar to theSciTail dataset. A carefully designed multi-step process is adopted to ensurethe quality of the dataset. We also present the results of traditional andstate-of-the-art methods on FarsTail including different embedding methods suchas word2vec, fastText, ELMo, BERT, and LASER, as well as different modelingapproaches such as DecompAtt, ESIM, HBMP, ULMFiT, and cross-lingual transferapproach to provide a solid baseline for the future research. The best obtainedtest accuracy is 78.13% which shows that there is a big room for improving thecurrent methods to be useful for real-world NLP applications in differentlanguages. The dataset is available at https://github.com/dml-qom/FarsTail.

Quick Read (beta)

loading the full paper ...