WAC：Wikipedia的语料库在线滥用检测

论文标题

WAC：Wikipedia的语料库在线滥用检测

WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection

论文作者

Cecillon, Noé, Labatut, Vincent, Dufour, Richard, Linares, Georges

论文摘要

随着在线社交网络的传播，监视所有用户生成的内容越来越困难。因此，在Internet上自动化不适当的交换内容的调节过程已成为优先任务。为此，已经提出了方法，但是找到合适的数据集来培训和开发它们可能会很具有挑战性。对于基于从对话的结构和动态而得出的信息的方法，此问题尤其如此。在这项工作中，我们根据Wikipedia评论语料库提出了一个原始框架，并提供了不同类型的评论级滥用注释。与现有语料库相比，主要的贡献涉及对话的重建，该公司仅着眼于孤立的消息（即从他们的对话环境中取出）。超过380k注释的消息的大型语料库为在线滥用检测，尤其是基于上下文的方法开辟了观点。除了这个语料库外，我们还建议一个完整的基准测试平台，以刺激和公平地比较围绕内容滥用检测问题的科学作品，试图避免重复出现的结果复制问题。最后，我们将两种分类方法应用于数据集以证明其潜力。

With the spread of online social networks, it is more and more difficult to monitor all the user-generated content. Automating the moderation process of the inappropriate exchange content on Internet has thus become a priority task. Methods have been proposed for this purpose, but it can be challenging to find a suitable dataset to train and develop them. This issue is especially true for approaches based on information derived from the structure and the dynamic of the conversation. In this work, we propose an original framework, based on the Wikipedia Comment corpus, with comment-level abuse annotations of different types. The major contribution concerns the reconstruction of conversations, by comparison to existing corpora, which focus only on isolated messages (i.e. taken out of their conversational context). This large corpus of more than 380k annotated messages opens perspectives for online abuse detection and especially for context-based approaches. We also propose, in addition to this corpus, a complete benchmarking platform to stimulate and fairly compare scientific works around the problem of content abuse detection, trying to avoid the recurring problem of result replication. Finally, we apply two classification methods to our dataset to demonstrate its potential.

下载PDF全文

下载文献需遵守相关版权规定

论文标题