论文标题
出售:僧伽罗进攻性语言数据集
SOLD: Sinhala Offensive Language Dataset
论文作者
论文摘要
在线进攻内容(例如仇恨言论和网络欺凌)是一种全球现象。这引发了人们对人工智能(AI)和自然语言处理(NLP)社区的兴趣,激发了培训以自动检测潜在有害内容的各种系统的开发。这些系统需要带注释的数据集来训练机器学习(ML)模型。但是,除了一些值得注意的例外,该主题的大多数数据集都涉及英语和其他一些高资源语言。结果,进攻性语言识别的研究仅限于这些语言。本文通过在僧伽罗(Sinhala)解决进攻性语言身份证明这一差距,这是斯里兰卡(Sri Lanka)超过1700万人使用的低资源印度 - 雅利安语言。我们介绍了Sinhala进攻性语言数据集(已出售),并在此数据集上进行了多个实验。出售是一个手动注释的数据集,其中包含来自Twitter的10,000个帖子,被注释为令人反感而不是令人反感的句子级别和令牌级别,从而提高了ML模型的解释性。出售是为僧伽罗(Sinhala)编辑的第一个大型公开进攻性语言数据集。我们还介绍了一个较大的数据集,该数据集包含超过145,000个Sinhala推文,并在半监督方法后注释。
The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (ML) models. However, with a few notable exceptions, most datasets on this topic have dealt with English and a few other high-resource languages. As a result, the research in offensive language identification has been limited to these languages. This paper addresses this gap by tackling offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka. We introduce the Sinhala Offensive Language Dataset (SOLD) and present multiple experiments on this dataset. SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models. SOLD is the first large publicly available offensive language dataset compiled for Sinhala. We also introduce SemiSOLD, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.