Investigating an Effective Character-level Embedding in Korean Sentence Classification

  • 2019-07-17 02:26:30
  • Won Ik Cho, Seok Min Kim, Nam Soo Kim
  • 0

Abstract

Different from the writing systems of many Romance and Germanic languages,some languages or language families show complex conjunct forms in charactercomposition. For such cases where the conjuncts consist of the componentsrepresenting consonant(s) and vowel, various character encoding schemes can beadopted beyond merely making up a one-hot vector. However, there has beenlittle work done on intra-language comparison regarding performances using eachrepresentation. In this study, utilizing the Korean language which ischaracter-rich and agglutinative, we investigate an encoding scheme that is themost effective among Jamo-level one-hot, character-level one-hot,character-level dense, and character-level multi-hot. Classificationperformance with each scheme is evaluated on two corpora: one on binarysentiment analysis of movie reviews, and the other on multi-classidentification of intention types. The result displays that the character-levelfeatures show higher performance in general, although the Jamo-level featuresmay show compatibility with the attention-based models if guaranteed adequateparameter set size.

 

Quick Read (beta)

loading the full paper ...