DarkBERT: A Language Model for the Dark Side of the Internet

Abstract

Recent research has suggested that there are clear differences in thelanguage used in the Dark Web compared to that of the Surface Web. As studieson the Dark Web commonly require textual analysis of the domain, languagemodels specific to the Dark Web may provide valuable insights to researchers.In this work, we introduce DarkBERT, a language model pretrained on Dark Webdata. We describe the steps taken to filter and compile the text data used totrain DarkBERT to combat the extreme lexical and structural diversity of theDark Web that may be detrimental to building a proper representation of thedomain. We evaluate DarkBERT and its vanilla counterpart along with otherwidely used language models to validate the benefits that a Dark Web domainspecific model offers in various use cases. Our evaluations show that DarkBERToutperforms current language models and may serve as a valuable resource forfuture research on the Dark Web.

Quick Read (beta)

loading the full paper ...