MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages

  • 2023-05-23 13:15:33
  • Cheikh M. Bamba Dione, David Adelani, Peter Nabende, Jesujoba Alabi, Thapelo Sindane, Happy Buzaaba, Shamsuddeen Hassan Muhammad, Chris Chinenye Emezue, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jonathan Mukiibi, Blessing Sibanda, Bonaventure F. P. Dossou, Andiswa Bukula, Rooweither Mabuya, Allahsera Auguste Tapo, Edwin Munkoh-Buabeng, victoire Memdjokam Koagne, Fatoumata Ouoba Kabore, Amelia Taylor, Godson Kalipe, Tebogo Macucwa, Vukosi Marivate, Tajuddeen Gwadabe, Mboning Tchiaze Elvis, Ikechukwu Onyenwe, Gratien Atindogbe, Tolulope Adelani, Idris Akinade, Olanrewaju Samuel, Marien Nahimana, Théogène Musabeyezu, Emile Niyomutabazi, Ester Chimhenga, Kudzai Gotosa, Patrick Mizha, Apelete Agbolo, Seydou Traore, Chinedu Uchechukwu, Aliyu Yusuf, Muhammad Abdullahi, Dietr
  • 0

Abstract

In this paper, we present MasakhaPOS, the largest part-of-speech (POS)dataset for 20 typologically diverse African languages. We discuss thechallenges in annotating POS for these languages using the UD (universaldependencies) guidelines. We conducted extensive POS baseline experiments usingconditional random field and several multilingual pre-trained language models.We applied various cross-lingual transfer models trained with data available inUD. Evaluating on the MasakhaPOS dataset, we show that choosing the besttransfer language(s) in both single-source and multi-source setups greatlyimproves the POS tagging performance of the target languages, in particularwhen combined with cross-lingual parameter-efficient fine-tuning methods.Crucially, transferring knowledge from a language that matches the languagefamily and morphosyntactic properties seems more effective for POS tagging inunseen languages.

 

Quick Read (beta)

loading the full paper ...