Linguistic data mining with complex networks: a stylometric-oriented approach

Abstract

By representing a text by a set of words and their co-occurrences, oneobtains a word-adjacency network being a reduced representation of a givenlanguage sample. In this paper, the possibility of using network representationto extract information about individual language styles of literary texts isstudied. By determining selected quantitative characteristics of the networksand applying machine learning algorithms, it is possible to distinguish betweentexts of different authors. Within the studied set of texts, English andPolish, a properly rescaled weighted clustering coefficients and weighteddegrees of only a few nodes in the word-adjacency networks are sufficient toobtain the authorship attribution accuracy over 90%. A correspondence betweenthe text authorship and the word-adjacency network structure can therefore befound. The network representation allows to distinguish individual languagestyles by comparing the way the authors use particular words and punctuationmarks. The presented approach can be viewed as a generalization of theauthorship attribution methods based on simple lexical features. Additionally, other network parameters are studied, both local and globalones, for both the unweighted and weighted networks. Their potential to capturethe writing style diversity is discussed; some differences between languagesare observed.

Quick Read (beta)

loading the full paper ...