R-grams: Unsupervised Learning of Semantic Units in Natural Language

  • 2018-08-14 13:15:43
  • Ariel Ekgren, Amaru Cuba Gyllensten, Magnus Sahlgren
  • 9

Abstract

This paper introduces a novel type of data-driven segmented unit that we callr-grams. We illustrate one algorithm for calculating r-grams, and discuss itsproperties and impact on the frequency distribution of text representations.The proposed approach is evaluated by demonstrating its viability in embeddingtechniques, both in monolingual and multilingual test settings. We also providea number of qualitative examples of the proposed methodology, demonstrating itsviability as a language-invariant segmentation procedure.

 

Quick Read (beta)

loading the full paper ...