Comparing Neural- and N-Gram-Based Language Models for Word Segmentation

Abstract

Word segmentation is the task of inserting or deleting word boundarycharacters in order to separate character sequences that correspond to words insome language. In this article we propose an approach based on a beam searchalgorithm and a language model working at the byte/character level, the lattercomponent implemented either as an n-gram model or a recurrent neural network.The resulting system analyzes the text input with no word boundaries one tokenat a time, which can be a character or a byte, and uses the informationgathered by the language model to determine if a boundary must be placed in thecurrent position or not. Our aim is to use this system in a preprocessing stepfor a microtext normalization system. This means that it needs to effectivelycope with the data sparsity present on this kind of texts. We also strove tosurpass the performance of two readily available word segmentation systems: Thewell-known and accessible Word Breaker by Microsoft, and the Python moduleWordSegment by Grant Jenks. The results show that we have met our objectives,and we hope to continue to improve both the precision and the efficiency of oursystem in the future.

Quick Read (beta)

loading the full paper ...