End-to-End Text Classification via Image-based Embedding using Character-level Networks

  • 2018-10-08 17:44:34
  • Shunsuke Kitada, Ryunosuke Kotani, Hitoshi Iyatomi
  • 6

Abstract

For analysing and/or understanding languages having no word boundaries basedon morphological analysis such as Japanese, Chinese, and Thai, it is desirableto perform appropriate word segmentation before word embeddings. But it isinherently difficult in these languages. In recent years, various languagemodels based on deep learning have made remarkable progress, and some of thesemethodologies utilizing character-level features have successfully avoided sucha difficult problem. However, when a model is fed character-level features ofthe above languages, it often causes overfitting due to a large number ofcharacter types. In this paper, we propose a CE-CLCNN, character-levelconvolutional neural networks using a character encoder to tackle theseproblems. The proposed CE-CLCNN is an end-to-end learning model and has animage-based character encoder, i.e. the CE-CLCNN handles each character in thetarget document as an image. Through various experiments, we found andconfirmed that our CE-CLCNN captured closely embedded features for visually andsemantically similar characters and achieves state-of-the-art results onseveral open document classification tasks. In this paper we report theperformance of our CE-CLCNN with the Wikipedia title estimation task andanalyse the internal behaviour.

 

Quick Read (beta)

loading the full paper ...