SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods

  • 2025-07-22 17:15:48
  • Roksana Goworek, Harpal Karlcut, Muhammad Shezad, Nijaguna Darshana, Abhishek Mane, Syam Bondada, Raghav Sikka, Ulvi Mammadov, Rauf Allahverdiyev, Sriram Purighella, Paridhi Gupta, Muhinyia Ndegwa, Haim Dubossarsky
  • 0

Abstract

This paper addresses the critical need for high-quality evaluation datasetsin low-resource languages to advance cross-lingual transfer. Whilecross-lingual transfer offers a key strategy for leveraging multilingualpretraining to expand language technologies to understudied and typologicallydiverse languages, its effectiveness is dependent on quality and suitablebenchmarks. We release new sense-annotated datasets of sentences containingpolysemous words, spanning ten low-resource languages across diverse languagefamilies and scripts. To facilitate dataset creation, the paper presents ademonstrably beneficial semi-automatic annotation method. The utility of thedatasets is demonstrated through Word-in-Context (WiC) formatted experimentsthat evaluate transfer on these low-resource languages. Results highlight theimportance of targeted dataset creation and evaluation for effective polysemydisambiguation in low-resource settings and transfer studies. The releaseddatasets and code aim to support further research into fair, robust, and trulymultilingual NLP.

 

Quick Read (beta)

loading the full paper ...