Abstract
Character-level models of tokens have been shown to be effective at dealingwith within-token noise and out-of-vocabulary words. But these models stillrely on correct token boundaries. In this paper, we propose a novel end-to-endcharacter-level model and demonstrate its effectiveness in multilingualsettings and when token boundaries are noisy. Our model is a semi-Markovconditional random field with neural networks for character and segmentrepresentation. It requires no tokenizer. The model matches state-of-the-artbaselines for various languages and significantly outperforms them on a noisyEnglish version of a part-of-speech tagging benchmark dataset.
Quick Read (beta)
loading the full paper ...