We propose a novel framework to understand the text by converting sentencesor articles into video-like 3-dimensional tensors. Each frame, corresponding toa slice of the tensor, is a word image that is rendered by the word's shape.The length of the tensor equals to the number of words in the sentence orarticle. The proposed transformation from the text to a 3-dimensional tensormakes it very convenient to implement an $n$-gram model with convolutionalneural networks for text analysis. Concretely, we impose a 3-dimensionalconvolutional kernel on the 3-dimensional text tensor. The first two dimensionsof the convolutional kernel size equal the size of the word image and the lastdimension of the kernel size is $n$. That is, every time when we slide the3-dimensional kernel over a word sequence, the convolution covers $n$ wordimages and outputs a scalar. By iterating this process continuously for each$n$-gram along with the sentence or article with multiple kernels, we obtain a2-dimensional feature map. A subsequent 1-dimensional max-over-time pooling isapplied to this feature map, and three fully-connected layers are used forconducting text classification finally. Experiments of several textclassification datasets demonstrate surprisingly superior performances usingthe proposed model in comparison with existing methods.
Quick Read (beta)
Text classification with pixel embedding
We propose a novel framework to understand the text by converting sentences or articles into video-like 3-dimensional tensors. Each frame, corresponding to a slice of the tensor, is a word image that is rendered by the word’s shape. The length of the tensor equals to the number of words in the sentence or article. The proposed transformation from the text to a 3-dimensional tensor makes it very convenient to implement an -gram model with convolutional neural networks for text analysis. Concretely, we impose a 3-dimensional convolutional kernel on the 3-dimensional text tensor. The first two dimensions of the convolutional kernel size equal the size of the word image and the last dimension of the kernel size is . That is, every time when we slide the 3-dimensional kernel over a word sequence, the convolution covers word images and outputs a scalar. By iterating this process continuously for each -gram along with the sentence or article with multiple kernels, we obtain a 2-dimensional feature map. A subsequent 1-dimensional max-over-time pooling is applied to this feature map, and three fully connected layers are used for conducting text classification finally. Experiments of several text classification datasets demonstrate surprisingly superior performances using the proposed model in comparison with existing methods.
Word representation, is the foundation of natural language processing (NLP) tasks, such as the text classification [Kim, 2014], machine translation [Sutskever et al., 2014], question answering [Zhou et al., 2015], etc. The most straightforward approach to word representation is the one-hot encoding, which projects words into sparse 1-of- vectors with being the size of the vocabulary. Another popular framework of word representation is to construct word vectors using the word2vec [Mikolov et al., 2013; Pennington et al., 2014], which is an unsupervised approach. Both the 1-of- and word2vec encodings have their own limitations. For example, the one-hot embedding has the issue of curse-of-dimensionality and word2vec requires the availability of a prior corpus for pre-training. Both one-hot and word2vec encodings are word-level embeddings. In addition to the word-level encoding, Zhang et al.  propose a character-level convolutional neural network (char-CNN) which quantifies the characters for alphabetic scripts. However, the character-level embedding is inapplicable to ideograph languages such as Chinese and Japanese, because the number of characters for such languages can be huge.
Intuitively, when we read an article on a screen, our eyes capture the text as a series of images which are then passed onto the brain for recognition and understanding. Hence, a natural way of word representation is to use visual shapes of the words or characters as features [Shimada et al., 2016; Sun et al., 2019; Su and Lee, 2017; Liu et al., 2017]. For examples, Su and Lee  and Shimada et al.  take Chinese and Japanese characters as images and apply a subsequent convolutional autoencoder to take those images as inputs and then output low-dimensional character embeddings. With such character embeddings, a char-CNN [Shimada et al., 2016] or traditional recurrent neural networks [Su and Lee, 2017; Liu et al., 2017] can be adapted for Chinese and Japanese text analysis tasks.
However, the limitations of these visual embedding models are obvious: (1) Characters are treated separately with a traditional local convolutional kernel, which ignores the statistics of characters or possibilities of words’ co-occurrence (-gram characteristics); (2) These models compress the word or character’s visual vector into a low-dimensional vector, which makes these models lack interpretability; and (3) Existing visual embedding based models are all designed for ideograph languages. To solve these problems, we propose a novel framework to adapt a word’s pixel (visual) embedding for English, while our model can be easily extended to any other languages. Concretely, we render the shape of a word in a document or a sentence as an image and then fold those images into a 3-dimensional tensor sequentially. That is, the document or sentence is converted into a video-like 3-dimensional tensor. Each frame of the video corresponds to a word and the length of the video equals to the number of words in the document. To capture the -gram characteristics of the text, we propose to impose 3-dimensional convolutions on the “text video". Compared with a small convolutional kernel (traditionally, a kernel is most frequently used), we use a big 3-dimensional kernel of size to extract statistics information of the text, where and are the width and height of word images respectively, and is the number of words covered by the kernel. This can be interpreted as an -gram model as shown in Figure 1. With multiple 3-dimensional kernels, the convolutional layer outputs a feature matrix, whose columns are features of -grams and rows correspond to the channels of different kernels. Following [Kim, 2014], a subsequent 1-dimensional max-over-time pooling is applied to this 2-dimensional feature map. Finally, three fully connected (FC) layers are used for conducting text classification.
The contributions of our work are three-fold:
We propose to represent a sentence or an article with a video-like 3-dimensional tensor, and each frame of this tensor represents one word in the sentence or article;
We use a 3-dimensional convolutional kernel to learn the -gram features from the tensor representation of the text;
We evaluate our model on several text classification tasks on both performances and interpretability.
2 Related Works
Recently, deep learning has been shown to achieve impressive performances on the NLP tasks [Kim, 2014; Wang et al., 2015; Iyyer et al., 2015; Goldberg, 2016; Jiang et al., 2018; Jacovi et al., 2018; Tang et al., 2019]. Under the NLP framework via deep neural networks, one typically needs to find a way to embed the raw text into features that computers can “recognize and understand”. Currently, the existing approaches for text embedding can be categorized into three frameworks from coarse to fine. The first one is the document-level or sentence-level approach that embeds documents or sentences into vectors [Le and Mikolov, 2014; Lin et al., 2017]. The second one is the word-level embedding [Mikolov et al., 2013; Pennington et al., 2014; Joulin et al., 2017], and the last one is character-level [Zhang et al., 2015] or radical-level embedding [Ke and Hagiwara, 2017].
The simplest implementation of the word-level embedding is to encode words as one-hot vectors. The dimension of the one-hot vector equals the size of the vocabulary . Typically, ranges from thousands to tens of thousands, which may hence lead to the issue of curse-of-dimensionality. Another approach is to construct a corpus-related matrix that contains statistical information of this corpus and then compute the word representation by factoring the matrix [Deerwester et al., 1990]. However, the size of the constructed matrix is usually large, which makes the decomposition very time-consuming. The most classical way of word-level embedding is based on the word2vec framework [Mikolov et al., 2013], which is originated from the neural language model. The word2vec encodes semantic features of words into a low-dimensional dense vector with the word’s local context, which is also called distributed word vectors. However, the quality of word vectors heavily depends on the quality and quantity of the corpus. As an improvement, Pennington et al.  propose to incorporate the global matrix factorization [Deerwester et al., 1990] and local context [Mikolov et al., 2013], which can strike a balance between the performance and cost. For an NLP task, both the one-hot and distributed representation methods have their own limitations: the one-hot embedding has the issue of curse-of-dimensionality and the word2vec requires the availability of a corpus as well as pre-training prior to a specific NLP task.
Another popular framework is the document to vector or sentence to vector [Le and Mikolov, 2014; Lin et al., 2017], which aims to represent sentences, paragraphs, and documents with vectors. Le and Mikolov  propose a “Paragraph Vector" model to learn fixed-length feature representations for sentences or documents in an unsupervised way. The “Paragraph Vector" is based on word2vec [Mikolov et al., 2013].
Zhang et al.  propose a character-level encoding model that quantifies the characters in English words sequentially. Combining this elegant design of text embedding with convolutional neural networks (CNN), their method achieves excellent results on text classification. Unfortunately, the character-level encoding method is only applicable to the phonogram, such as English, but cannot be extended to logogram languages, such as Chinese or Japanese. Following the char-CNN, Ke and Hagiwara  propose to encode Chinese and Japanese characters with the semantic radical components to bridge this gap. In their model, each Chinese or Japanese character can be divided into a sequence of radical-level embeddings. However, the radical-level method ignores the spatial structure of Chinese characters, which is a big difference between Chinese and alphabetic scripts. The pixel embedding proposed in this paper is completely different from the existing encoding methods. It is directly motivated by the way of human reading, for which eyes receive visual signals of the text and then send them to the brain for further analysis. Therefore, we use the pixel image of the text as its representation which exactly mimics the way how human read the text.
In contrast to the one-hot, word2vec, and char-level embedding, human read and understand the text from a completely different perspective, which is based on the visual shapes of the words. Intuitively, when we read a web article on the screen or a book, our eyes capture the text as a series of images rather than embedding them into vectors. In other words, human understand the text with the visual information of the words, i.e., we recognize characters or words from their images that are captured by our eyes. Therefore, we believe that the pixel image, i.e., the character’s morphological shape, provides a natural way to represent characters and words. Motivated by this idea, several visual embedding methods [Shimada et al., 2016; Su and Lee, 2017; Sun et al., 2019] have been developed for Chinese and Japanese text understanding. However, it is very difficult to visually embed alphabetic languages such as English, because English words cannot be rendered as the same sized image as Chinese or Japanese characters.
|Slangs and abbreviations|
|Remove stop words|
|Remove low-frequency words|
|Stem and lemmatization||–||–|
|Maintain a vocabulary|
|Sparsity of vector|
|Dimension of vector||70|
The proposed model for the text classification is shown in Figure 1. Given a document or a sentence , we first render the word in this document as a matrix . Sequentially, a series of text matrices are then folded into a 3-dimensional tensor , where is the length of the sentence. In other words, the document or a sentence is taken as a “video", and each frame of the video corresponds a word of with the size of . Compared with extracting a word’s representation from the visual pixel map with a convolutional autoencoder [Shimada et al., 2016; Su and Lee, 2017], our model prefers to using a 3-dimensional convolutional layer to deal with the “text video". The size of the convolutional kernel is , where is the number of words that the kernel covers at a time. Hence, the 3-dimensional kernel acts as an -gram detector. The single convolution with multiple kernels produces a new feature map as shown in Figure 1. After the operation of the single convolutional layer, we apply the max-over-time pooling [Collobert et al., 2011] to carry out down-sampling. The max-over-time pooling operation in our model is different from the traditional ones that are popular in the field of computer vision. We conduct a 1-dimensional max-pooling procedure along the time axis for each channel. The 2-dimensional feature map is combined using the max-over-time pooling procedure followed by a nonlinear function activation (e.g., the ReLu function). Finally, we flatten the feature map after the max pooling, and the FC layers accept the flattened vectors as inputs to make the final classification.
Table 1 compares the word’s visual representation with other existing word embedding schemes in terms of data preprocessing steps. From the summarization, it is clear that the char-CNN, and our method require much less preprocessing steps than the one-hot vector and distributed representation. The last two rows in Table 1 summarize the sparsity and dimension of word vectors for each method.
3.2 Network Implementation
The network architecture can be described as follows:
Conv3d layer: kernel size = (20, 131, 3), stride = (1, 1, 1), number of kernels = 50, padding = 0;
MaxPool1d layer (the max-over-time pooling): kernel size = 3, stride = 3, dilation = 3, padding = 0 ;
FC layer 1: input = 1250, output = 512;
FC layer 2: input = 512, output = 100;
FC layer 3: input = 100, output = number of classes.
The specification stride = 3 for the MaxPool1d results in no overlaps in max-over-time pooling.
3.3 Model Interpretation
The proposed model has a concise structure with one convolutional layer, one max-pooling layer, and three subsequent FC layers. In image processing, a convolution between a kernel and local pixels (usually covering a dimensional area) of an image can blur, sharpen, emboss the image or detect edges of this image. This process is often applied to the neighbourhood of the local area repeatedly.
Different from the traditional way that the kernel focuses on local pixels of an image, we propose to compute the weighted average between the convolutional kernel and the whole word image as shown in Figure 1. It suggests that the size of the convolution kernel should be the same as the size of the images. Because of the video-like representation of the text data, when the 3-dimensional kernel slides over the text tensor, it computes the convolutional weighted average for several word images at one time. We prefer to such kind of global convolution rather than the local convolution for the following reasons: First, the information of text images is centralized; second, this design makes it very convenient to interpret the convolutional operation as an -gram detector.
As shown in Figure 1, every time when the convolutional kernel slides over the text, it operates on two neighbouring word images, and in this case it is a 2-gram detector. During the training, we input a sentence or an article which contains words, and the first layer of the proposed model would output -gram feature vectors sequentially. Some of the high-frequency -grams of the corpus can be repeatedly detected by the 3-dimensional kernel. Therefore, the values of the corresponding components in the feature vector for those high-frequency -grams are larger than others. In contrast, the components in the feature vector that corresponds to the low-frequency word pairs would be small. By applying different kernels, we can obtain a feature map as the output of the first layer as shown in Figure 1. The columns of are the -gram features and the rows correspond to the channels.
For the testing, by inputting a test sentence, a corresponding feature map is produced by the first layer of the trained model. As stated earlier, a larger value of indicates that the -th 2-gram of this test sentence is more frequently detected by the -th filter, where is the index of kernel, is the index of 2-gram phrase.
|DBPedia||448,000||112,000||70,000||14||52||Titleabstract of article|
“Classes” represents the number of classes, and “Ave length” refers to the average number of words in the content.
For comparisons, we consider four baseline methods as follows:
The character-level convolutional neural networks (char-CNN) [Zhang et al., 2015].
CNN for text classification on top of the one-hot word vectors denoted as CNN one-hot;
CNN for text classification on top of the distributed word vectors obtained via word2vec [Kim, 2014] denoted as CNN wor2vec;
FastText [Joulin et al., 2017].
We experiment with two variants of the proposed model:
Our model with the max-over-time pooling, as shown in Figure 1;
Our model by substituting the 1-dimensional max-over-time pooling with a 2-dimensional max pooling. The kernel size is .
Five datasets used in our experiments are described as follows:
The AG’s news corpus is a collection of more than 1 million news articles 11 1 https://www.di.unipi.it/g̃ulli/AG_corpus_of_news_articles.html. In [Zhang et al., 2015], four largest classes, namely World, Sports, Business, and Science/Technology, are selected from this corpus. Each sample is constructed by joining the title and description fields.
The DBPedia ontology dataset 22 2 https://wiki.dbpedia.org/services-resources/dbpedia-data-set-2014. The DBpedia dataset uses a large multi-domain ontology which has been derived from Wikipedia [Lehmann et al., 2015]. The DBpedia ontology dataset is constructed by picking 14 non-overlapping classes: Company, Educational Institution, Artist, Athlete, Office Holder, Mean Of Transportation, Building, Natural Place, Village, Animal, Plant, Album, Film, Written Work. For each of these 14 ontology classes, the fields of samples we used are the joint of the title and abstract of each Wikipedia article.
Yelp reviews. The Yelp reviews dataset is obtained from the Yelp Dataset Challenge in 2015. Each review of this dataset has one user’s review score ranging from 1 star to 5 stars. Predicting the number of users’ review stars corresponds to a 5-class classification task.
Yahoo Answers corpus. This corpus is extracted from the Yahoo! Answers Comprehensive Questions and Answers version 1.0. We follow [Zhang et al., 2015] to construct a topic classification dataset from this corpus by selecting 10 main categories: Society, Culture Science, Mathematics Education, Reference Computers, Internet Sports Business, Finance, Entertainment, Music Family, Relationships Politics, Government .
The Amazon reviews dataset. We obtain the Amazon review dataset from the Stanford Network Analysis Project (SNAP), which spans over 18 years with 34,686,770 reviews from 6,643,669 users on 2,441,053 products [McAuley and Leskovec, 2013]. Different from the Yelp review dataset, we predict the binary sentiment label for each review in the Amazon dataset. The sentiment classes of reviews with 1 or 2 stars are labelled as negative, and those with 3 or 4 stars are labelled as positive. The samples split for training, validation, and testing for all the five datasets are shown in Table 2.
4.3 Training setting
For the implementation of our model, we need to render words into word images with their shapes. Unfortunately, the lengths of English words vary dramatically and sometimes can be very large as shown in Figure 2. In particular, some words in the web context can have over 50 characters. For word image rendering, we adapt the size of images with the longest word. If the maximum length is too large, it can increase the blank space of word images, which causes redundancy for short words.
For English, most of the lengths of the words are less than 17 as shown in Figure 2. To balance the performance and redundancy, words with lengths greater than 17 are removed from the corpus. That is, we set the maximum length of words in our corpus as 17. With this threshold, we render each word in our corpus into word images. Here, the 131 pixels is the minimum width that can load 17 English characters with a font size of 20. The font we used for English characters is “New Times".
We use the Adam [Kingma and Ba, 2014] algorithm as the network optimizer with the learning rate equal to 0.0001. The dropout rate of the FC network is 0.5.
4.4 Prediction for Topics and Sentiments
The first experiment concerns the overall performance of text classification on both the document’s topics and sentiments. Table 3 shows the testing accuracy for all the five listed datasets under all the models. The first four datasets have multiple categories, and the last dataset “Amazon" has binary labels of review’s polarity.
Clearly, the proposed methods achieve superior performances on the text classification compared with the existing ones for all five datasets. It is worth emphasizing that both our models (with or without max-over-time pooling) accept the text images as inputs without any preprocessing steps required by other approaches such as removing misspellings, low-frequency words, stop words, stem and lemmatization, maintaining a vocabulary for words or characters, etc. Our methods also do not need to pre-train word vectors as the word2vec based methods. The results of the two variants of our model demonstrate that the max-over-time pooling is an efficient and necessary operation for text feature extraction. Furthermore, compared with the existing methods, the proposed model is much more interpretable, as will be detailed in the next section.
|New w/o max||0.83||0.93||0.75||0.54||0.91|
4.5 Visualization of the -gram detectors
The 3-dimensional convolutional kernel acts as an -gram detector in our model. As shown in Figure 1, the conventional kernel operates on two frames (i.e., two words) at a time, which thus corresponds to a 2-gram detector. For a sentence of length , we can generate a feature vector , which is a continuous -gram feature of . By applying different kernels, we can obtain a feature map after the 3-dimensional convolution.
During the testing, we input a test sentence, and a corresponding feature map is the output. A larger value of indicates that the -th 2-gram of the input sentence is more significant for the classification. By identifying the maximum element of , we can easily identify the most-weight word pair (the -th word and -th word) in this sentence reversely, where , is the number of words in this sentence.
We visualize the weighted -grams according to the first layer of the network trained on the task of classifying the AG’s news dataset. It has four classes, “World", “Sports", “Business", and “Science & Technology". There are 7600 test samples for all categories and each class has 1900 samples. Because all the testing 2-grams have been weighted by the feature map for each class, we can visualize them with two-words phrase (2-grams) clouds separately as shown in Figure 4 (a)–(d). A larger font size in the word cloud pictures indicates a higher frequency that this two-words phrase has been detected by the 3-dimensional -gram detectors. The two words of these 2-grams are joined by an underscore “_" for the convenience of visualization.
According to the results of Figure 4 (a), we observe that the -gram “in Iraq" is the most frequent phrase that has been detected for the category “World" news. Some other phrases such as “Canadian Press", “NEW YORK", “President Bush", “UNITED NATIONS" and so on that are associated with the category “World" news, have also been highlighted by our 2-gram detectors. In Figure 4 (b)–(d), we can also see that the 2-grams “World Cup", “the Olympic", “Formula One" have been detected for the category “Sport"; “the company", “target= stocks", “Oil prices" and so on have been detected for the category “Business"; and “the company", “Apple Company" and so on have been detected for the category “Science & Technology".
By comparing Figure 4 (a) with Figure 4 (d), we observe that “NEW YORK" is the intersection of the high-frequency phrase between the most-weighted 2-grams for categories “World" and “Science & Technology". It suggests that there might be ambiguity for our 2-gram detector when classifying the phrase “NEW YORK" in the categories of “World" news or “Science & Technology" news. The same situation arises when categorizing “Business" and “Science & Technology". By comparing Figure 4 (c) with Figure 4 (d), we find that the most-highlighted 2-grams for “Business" and “Science & Technology" also have an intersection of “the company", which clearly belong to both categories. In contrast, the highlighted 2-grams in Figure 4 (b) has no intersection with other three categories, which makes the category “Sports" most distinctive from the others. The confusion matrix of the four-class classification in Figure 3 supports this argument.
We propose a novel framework to understand the text data by converting English sentences or articles into a video-like 3-dimensional tensors, which can be viewed as “video text”. Each frame or each slice of the tensor is a word image that is rendered as the word’s shape. This transformation makes it convenient to implement an -gram model based on the convolutional neural networks. We achieve this goal by imposing a 3-dimensional convolutional kernel on text tensors. The first two dimensions of the kernel size are the same as the size of the word image and the last dimension of the kernel size is . That is, the 3-dimensional kernel covers words and outputs a scalar each time. A subsequent 1-dimensional max-over-time pooling is applied to this feature map, and then three FC layers are implemented with a final goal for text classification. Experiments of text classification on both topic and sentiment analysis illustrate surprisingly excellent results of the proposed model. Our model can be easily applied to other languages as well as other NLP tasks such as the machine translation.
- Collobert et al.  Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493–2537, 2011.
- Deerwester et al.  Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407, 1990.
- Goldberg  Yoav Goldberg. A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research, 57:345–420, 2016.
- Iyyer et al.  Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1681–1691, 2015.
- Jacovi et al.  Alon Jacovi, Oren Sar Shalom, and Yoav Goldberg. Understanding convolutional neural networks for text classification. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 56–65, 2018.
- Jiang et al.  Mingyang Jiang, Yanchun Liang, Xiaoyue Feng, Xiaojing Fan, Zhili Pei, Yu Xue, and Renchu Guan. Text classification based on deep belief network and softmax regression. Neural Computing and Applications, 29(1):61–70, 2018.
- Joulin et al.  Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, volume 2, pages 427–431, 2017.
- Ke and Hagiwara  Yuanzhi Ke and Masafumi Hagiwara. Radical-level ideograph encoder for rnn-based sentiment analysis of chinese and japanese. In Asian Conference on Machine Learning, pages 561–573, 2017.
- Kim  Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, 2014.
- Kingma and Ba  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Le and Mikolov  Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196, 2014.
- Lehmann et al.  Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer. Dbpedia – a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167–195, 2015.
- Lin et al.  Zhouhan Lin, Minwei Feng, Cícero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. In ICLR 2017 : International Conference on Learning Representations 2017, 2017.
- Liu et al.  Frederick Liu, Han Lu, Chieh Lo, and Graham Neubig. Learning character-level compositionality with visual features. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2059–2068, 2017.
- McAuley and Leskovec  Julian McAuley and Jure Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems, pages 165–172. ACM, 2013.
- Mikolov et al.  Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119, 2013.
- Pennington et al.  Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
- Shimada et al.  Daiki Shimada, Ryunosuke Kotani, and Hitoshi Iyatomi. Document classification through image-based character embedding and wildcard training. In 2016 IEEE International Conference on Big Data (Big Data), pages 3922–3927. IEEE, 2016.
- Su and Lee  Tzuray Su and Hungyi Lee. Learning chinese word representations from glyphs of characters. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 264–273, 2017.
- Sun et al.  Chi Sun, Xipeng Qiu, and Xuanjing Huang. Vcwe: Visual character-enhanced word embeddings. In NAACL-HLT 2019: Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 2710–2719, 2019.
- Sutskever et al.  Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
- Tang et al.  Buzhou Tang, Xiaolong Wang, Jun Yan, and Qingcai Chen. Entity recognition in chinese clinical text using attention-based cnn-lstm-crf. BMC medical informatics and decision making, 19(3):74, 2019.
- Wang et al.  Peng Wang, Jiaming Xu, Bo Xu, Chenglin Liu, Heng Zhang, Fangyuan Wang, and Hongwei Hao. Semantic clustering and convolutional neural network for short text categorization. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 352–357, 2015.
- Zhang et al.  Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657, 2015.
- Zhou et al.  Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167, 2015.