Abstract
We propose a novel framework to understand the text by converting sentencesor articles into video-like 3-dimensional tensors. Each frame, corresponding toa slice of the tensor, is a word image that is rendered by the word's shape.The length of the tensor equals to the number of words in the sentence orarticle. The proposed transformation from the text to a 3-dimensional tensormakes it very convenient to implement an $n$-gram model with convolutionalneural networks for text analysis. Concretely, we impose a 3-dimensionalconvolutional kernel on the 3-dimensional text tensor. The first two dimensionsof the convolutional kernel size equal the size of the word image and the lastdimension of the kernel size is $n$. That is, every time when we slide the3-dimensional kernel over a word sequence, the convolution covers $n$ wordimages and outputs a scalar. By iterating this process continuously for each$n$-gram along with the sentence or article with multiple kernels, we obtain a2-dimensional feature map. A subsequent 1-dimensional max-over-time pooling isapplied to this feature map, and three fully-connected layers are used forconducting text classification finally. Experiments of several textclassification datasets demonstrate surprisingly superior performances usingthe proposed model in comparison with existing methods.