How to Evaluate Word Representations of Informal Domain?

Abstract

Diverse word representations have surged in most state-of-the-art naturallanguage processing (NLP) applications. Nevertheless, how to efficientlyevaluate such word embeddings in the informal domain such as Twitter or forums,remains an ongoing challenge due to the lack of sufficient evaluation dataset.We derived a large list of variant spelling pairs from UrbanDictionary with theautomatic approaches of weakly-supervised pattern-based bootstrapping andself-training linear-chain conditional random field (CRF). With these extractedrelation pairs we promote the odds of eliding the text normalization procedureof traditional NLP pipelines and directly adopting representations ofnon-standard words in the informal domain. Our code is available.

Quick Read (beta)

loading the full paper ...