A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

Abstract

Segmentation remains an important preprocessing step both in languages where"words" or other important syntactic/semantic units (like morphemes) are notclearly delineated by white space, as well as when dealing with continuousspeech data, where there is often no meaningful pause between words.Near-perfect supervised methods have been developed for use in resource-richlanguages such as Chinese, but many of the world's languages are bothmorphologically complex, and have no large dataset of "gold" segmentations intomeaningful units. To solve this problem, we propose a new type of SegmentalLanguage Model (Sun and Deng, 2018; Kawakami et al., 2019; Wang et al., 2021)for use in both unsupervised and lightly supervised segmentation tasks. Weintroduce a Masked Segmental Language Model (MSLM) built on a span-maskingtransformer architecture, harnessing the power of a bi-directional maskedmodeling context and attention. In a series of experiments, our modelconsistently outperforms Recurrent SLMs on Chinese (PKU Corpus) in segmentationquality, and performs similarly to the Recurrent model on English (PTB). Weconclude by discussing the different challenges posed in segmentingphonemic-type writing systems.

Quick Read (beta)

loading the full paper ...