精神：在线仇恨言语检测数据集

论文标题

精神：在线仇恨言语检测数据集

ETHOS: an Online Hate Speech Detection Dataset

论文作者

Mollas, Ioannis, Chrysopoulou, Zoe, Karlos, Stamatis, Tsoumakas, Grigorios

论文摘要

在线仇恨言论是我们社会中最近的一个问题，它通过利用特征大多数社交媒体平台的相应制度的脆弱性来稳步增长。这种现象主要是在用户互动期间或以发布的多媒体上下文的形式来培养的。如今，巨型公司拥有数百万用户每天登录的平台，并且为了遵守相应的立法并保持高水平的服务质量似乎是必要的。一个可靠和可靠的系统用于检测和防止相关内容上载，将对我们的数字互连社会产生重大影响。我们日常生活的几个方面无疑与我们的社会形象有关，使我们容易受到虐待行为。结果，缺乏准确的仇恨言论检测机制会严重降低整体用户体验，尽管其错误的操作会引起许多道德问题。在本文中，我们提出了一个具有两个变体的文本数据集“精神”：基于YouTube的二进制和多标签和多标签，并使用Figun-oight八众群体众库平台验证了Reddit评论。此外，我们介绍用于创建此数据集的注释协议：一个主动采样过程，用于平衡与定义的各个方面有关的数据。我们的关键假设是，即使从这种耗时的过程中获得少量标记的数据，我们也可以保证所检查材料中的仇恨言论发生。

Online hate speech is a recent problem in our society that is rising at a steady pace by leveraging the vulnerabilities of the corresponding regimes that characterise most social media platforms. This phenomenon is primarily fostered by offensive comments, either during user interaction or in the form of a posted multimedia context. Nowadays, giant corporations own platforms where millions of users log in every day, and protection from exposure to similar phenomena appears to be necessary in order to comply with the corresponding legislation and maintain a high level of service quality. A robust and reliable system for detecting and preventing the uploading of relevant content will have a significant impact on our digitally interconnected society. Several aspects of our daily lives are undeniably linked to our social profiles, making us vulnerable to abusive behaviours. As a result, the lack of accurate hate speech detection mechanisms would severely degrade the overall user experience, although its erroneous operation would pose many ethical concerns. In this paper, we present 'ETHOS', a textual dataset with two variants: binary and multi-label, based on YouTube and Reddit comments validated using the Figure-Eight crowdsourcing platform. Furthermore, we present the annotation protocol used to create this dataset: an active sampling procedure for balancing our data in relation to the various aspects defined. Our key assumption is that, even gaining a small amount of labelled data from such a time-consuming process, we can guarantee hate speech occurrences in the examined material.

下载PDF全文

下载文献需遵守相关版权规定

论文标题