SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages

Abstract

Question Answering (QA) datasets have been instrumental in developing andevaluating Large Language Model (LLM) capabilities. However, such datasets arescarce for languages other than English due to the cost and difficulties ofcollection and manual annotation. This means that producing novel models andmeasuring the performance of multilingual LLMs in low-resource languages ischallenging. To mitigate this, we propose $\textbf{S}$yn$\textbf{DAR}$in, amethod for generating and validating QA datasets for low-resource languages. Weutilize parallel content mining to obtain $\textit{human-curated}$ paragraphsbetween English and the target language. We use the English data as context to$\textit{generate}$ synthetic multiple-choice (MC) question-answer pairs, whichare automatically translated and further validated for quality. Combining thesewith their designated non-English $\textit{human-curated}$ paragraphs form thefinal QA dataset. The method allows to maintain the content quality, reducesthe likelihood of factual errors, and circumvents the need for costlyannotation. To test the method, we created a QA dataset with $1.2$K samples forthe Armenian language. The human evaluation shows that $98\%$ of the generatedEnglish data maintains quality and diversity in the question types and topics,while the translation validation pipeline can filter out $\sim70\%$ of datawith poor quality. We use the dataset to benchmark state-of-the-art LLMs,showing their inability to achieve human accuracy with some model performancescloser to random chance. This shows that the generated dataset is non-trivialand can be used to evaluate reasoning capabilities in low-resource language.

Quick Read (beta)

loading the full paper ...