XNLI 2.0: Improving XNLI dataset and performance on Cross Lingual Understanding (XLU)

  • 2023-01-16 17:24:57
  • Ankit Kumar Upadhyay, Harsit Kumar Upadhya
  • 1


Natural Language Processing systems are heavily dependent on the availabilityof annotated data to train practical models. Primarily, models are trained onEnglish datasets. In recent times, significant advances have been made inmultilingual understanding due to the steeply increasing necessity of workingin different languages. One of the points that stands out is that since thereare now so many pre-trained multilingual models, we can utilize them forcross-lingual understanding tasks. Using cross-lingual understanding andNatural Language Inference, it is possible to train models whose applicationsextend beyond the training language. We can leverage the power of machinetranslation to skip the tiresome part of translating datasets from one languageto another. In this work, we focus on improving the original XNLI dataset byre-translating the MNLI dataset in all of the 14 different languages present inXNLI, including the test and dev sets of XNLI using Google Translate. We alsoperform experiments by training models in all 15 languages and analyzing theirperformance on the task of natural language inference. We then expand ourboundary to investigate if we could improve performance in low-resourcelanguages such as Swahili and Urdu by training models in languages other thanEnglish.


