Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist

  • 2025-10-24 15:51:10
  • Kellen Parker van Dam, Abishek Stephen
  • 0

Abstract

Lexical data collection in language documentation often containstranscription errors and undocumented borrowings that can mislead linguisticanalysis. We present unsupervised anomaly detection methods to identifyphonotactic inconsistencies in wordlists, applying them to a multilingualdataset of Kokborok varieties with Bangla. Using character-level andsyllable-level phonotactic features, our algorithms identify potentialtranscription errors and borrowings. While precision and recall remain modestdue to the subtle nature of these anomalies, syllable-aware featuressignificantly outperform character-level baselines. The high-recall approachprovides fieldworkers with a systematic method to flag entries requiringverification, supporting data quality improvement in low-resourced languagedocumentation.

 

Quick Read (beta)

loading the full paper ...