Building an Expert Annotated Corpus of Brazilian Instagram Comments for Hate Speech and Offensive Language Detection

  • 2021-05-02 20:58:41
  • Francielle Alves Vargas, Isabelle Carvalho, Fabiana Rodrigues de Góes, Fabrício Benevenuto, Thiago Alexandre Salgueiro Pardo
  • 0


The understanding of an offense is subjective and people may have differentopinions about the offensiveness of a comment. Also, offenses and hate speechmay occur through sarcasm, which hides the real intention of the comment andmakes the decision of the annotators more confusing. Therefore, provide awell-structured annotation process is crucial to a better understanding of hatespeech and offensive language phenomena, as well as supply better performancefor machine learning classifiers. In this paper, we describe a corpusannotation process, which was guided by a linguist, and a hate speech skilledto support the identification of hate speech and offensive language on socialmedia. In addition, we provide the first robust corpus of this kind for theBrazilian Portuguese language. The corpus was collected from Instagram posts ofpolitical personalities and manually annotated, being composed by 7,000documents annotated according to three different layers: a binaryclassification (offensive versus non-offensive language), the level of offense(highly offensive, moderately offensive, and slightly offensive messages), andthe identification regarding the target of the discriminatory content(xenophobia, racism, homophobia, sexism, religious intolerance, partyism,apology to the dictatorship, antisemitism, and fat phobia). Each comment wasannotated by three different annotators, and achieved high inter-annotatoragreement. The new proposed annotation approach is also language anddomain-independent (although it has been applied for Brazilian Portuguese).


Quick Read (beta)

loading the full paper ...