Word2Vec is a special case of Kernel Correspondence Analysis and Kernels for Natural Language Processing

  • 2018-11-25 17:49:08
  • Hirotaka Niitsuma, Minho Lee
We show that correspondence analysis (CA) is equivalent to defining a Giniindex with appropriately scaled one-hot encoding. Using this relation, weintroduce a nonlinear kernel extension to CA. This extended CA gives a knownanalysis for natural language via specialized kernels that use an appropriatecontingency table. We propose a semi-supervised CA, which is a special case ofthe kernel extension to CA. Because CA requires excessive memory if applied tonumerous categories, CA has not been used for natural language processing. Weaddress this problem by introducing delayed evaluation to randomized singularvalue decomposition. The memory-efficient CA is then applied to a word-vectorrepresentation task. We propose a tail-cut kernel, which is an extension to theskip-gram within the kernel extension to CA. Our tail-cut kernel outperformsexisting word-vector representation methods.


