Oldies but Goldies: The Potential of Character N-grams for Romanian Texts

Abstract

This study addresses the problem of authorship attribution for Romanian textsusing the ROST corpus, a standard benchmark in the field. We systematicallyevaluate six machine learning techniques: Support Vector Machine (SVM),Logistic Regression (LR), k-Nearest Neighbors (k-NN), Decision Trees (DT),Random Forests (RF), and Artificial Neural Networks (ANN), employing charactern-gram features for classification. Among these, the ANN model achieved thehighest performance, including perfect classification in four out of fifteenruns when using 5-gram features. These results demonstrate that lightweight,interpretable character n-gram approaches can deliver state-of-the-art accuracyfor Romanian authorship attribution, rivaling more complex methods. Ourfindings highlight the potential of simple stylometric features in resource,constrained or under-studied language settings.

Quick Read (beta)

loading the full paper ...