Hoaxpedia: A Unified Wikipedia Hoax Articles Dataset

Abstract

Hoaxes are a recognised form of disinformation created deliberately, withpotential serious implications in the credibility of reference knowledgeresources such as Wikipedia. What makes detecting Wikipedia hoaxes hard is thatthey often are written according to the official style guidelines. In thiswork, we first provide a systematic analysis of similarities and discrepanciesbetween legitimate and hoax Wikipedia articles, and introduce Hoaxpedia, acollection of 311 hoax articles (from existing literature and officialWikipedia lists), together with semantically similar legitimate articles, whichtogether form a binary text classification dataset aimed at fostering researchin automated hoax detection. In this paper, We report results after analyzingseveral language models, hoax-to-legit ratios, and the amount of textclassifiers are exposed to (full article vs the article's definition alone).Our results suggest that detecting deceitful content in Wikipedia based oncontent alone is hard but feasible, and complement our analysis with a study onthe differences in distributions in edit histories, and find that looking atthis feature yields better classification results than context.

Quick Read (beta)

loading the full paper ...