The Power of Communities: A Text Classification Model with Automated Labeling Process Using Network Community Detection

Abstract

Text classification is one of the most critical areas in machine learning andartificial intelligence research. It has been actively adopted in many businessapplications such as conversational intelligence systems, news articlescategorizations, sentiment analysis, emotion detection systems, and many otherrecommendation systems in our daily life. One of the problems in supervisedtext classification models is that the models' performance depends heavily onthe quality of data labeling that is typically done by humans. In this study,we propose a new network community detection-based approach to automaticallylabel and classify text data into multiclass value spaces. Specifically, webuild networks with sentences as the network nodes and pairwise cosinesimilarities between the Term Frequency-Inversed Document Frequency (TFIDF)vector representations of the sentences as the network link weights. We use theLouvain method to detect the communities in the sentence networks. We train andtest the Support Vector Machine and the Random Forest models on both thehuman-labeled data and network community detection labeled data. Results showedthat models with the data labeled by the network community detectionoutperformed the models with the human-labeled data by 2.68-3.75% ofclassification accuracy. Our method may help developments of more accurateconversational intelligence and other text classification systems.

Quick Read (beta)

loading the full paper ...