SOLD: Sinhala Offensive Language Dataset

Abstract

The widespread of offensive content online, such as hate speech andcyber-bullying, is a global phenomenon. This has sparked interest in theartificial intelligence (AI) and natural language processing (NLP) communities,motivating the development of various systems trained to detect potentiallyharmful content automatically. These systems require annotated datasets totrain the machine learning (ML) models. However, with a few notable exceptions,most datasets on this topic have dealt with English and a few otherhigh-resource languages. As a result, the research in offensive languageidentification has been limited to these languages. This paper addresses thisgap by tackling offensive language identification in Sinhala, a low-resourceIndo-Aryan language spoken by over 17 million people in Sri Lanka. We introducethe Sinhala Offensive Language Dataset (SOLD) and present multiple experimentson this dataset. SOLD is a manually annotated dataset containing 10,000 postsfrom Twitter annotated as offensive and not offensive at both sentence-leveland token-level, improving the explainability of the ML models. SOLD is thefirst large publicly available offensive language dataset compiled for Sinhala.We also introduce SemiSOLD, a larger dataset containing more than 145,000Sinhala tweets, annotated following a semi-supervised approach.

Quick Read (beta)

loading the full paper ...