Data and Representation for Turkish Natural Language Inference

Abstract

Large annotated datasets in NLP are overwhelmingly in English. This is anobstacle to progress in other languages. Unfortunately, obtaining new annotatedresources for each task in each language would be prohibitively expensive. Atthe same time, commercial machine translation systems are now robust. Can weleverage these systems to translate English-language datasets automatically? Inthis paper, we offer a positive response for natural language inference (NLI)in Turkish. We translated two large English NLI datasets into Turkish and had ateam of experts validate their translation quality and fidelity to the originallabels. Using these datasets, we address core issues of representation forTurkish NLI. We find that in-language embeddings are essential and thatmorphological parsing can be avoided where the training set is large. Finally,we show that models trained on our machine-translated datasets are successfulon human-translated evaluation sets. We share all code, models, and datapublicly.

Quick Read (beta)

loading the full paper ...