Tokenization Matters: Improving Zero-Shot NER for Indic Languages

Abstract

Tokenization is a critical component of Natural Language Processing (NLP),especially for low resource languages, where subword segmentation influencesvocabulary structure and downstream task accuracy. Although Byte Pair Encoding(BPE) is a standard tokenization method in multilingual language models, itssuitability for Named Entity Recognition (NER) in low resource Indic languagesremains underexplored due to its limitations in handling morphologicalcomplexity. In this work, we systematically compare BPE, SentencePiece, andCharacter Level tokenization strategies using IndicBERT for NER tasks in lowresource Indic languages like Assamese, Bengali, Marathi, and Odia, as well asextremely low resource Indic languages like Santali, Manipuri, and Sindhi. Weassess both intrinsic linguistic properties tokenization efficiency, out ofvocabulary (OOV) rates, and morphological preservation as well as extrinsicdownstream performance, including fine tuning and zero shot cross lingualtransfer. Our experiments show that SentencePiece is a consistently better performingapproach than BPE for NER in low resource Indic Languages, particularly in zeroshot cross lingual settings, as it better preserves entity consistency. WhileBPE provides the most compact tokenization form, it is not capable ofgeneralization because it misclassifies or even fails to recognize entitylabels when tested on unseen languages. In contrast, SentencePiece constitutesa better linguistic structural preservation model, benefiting extremely lowresource and morphologically rich Indic languages, such as Santali andManipuri, for superior entity recognition, as well as high generalizationacross scripts, such as Sindhi, written in Arabic. The results point toSentencePiece as the more effective tokenization strategy for NER withinmultilingual and low resource Indic NLP applications.

Quick Read (beta)

loading the full paper ...