半监督Web参数COLPORA的清洁

论文标题

半监督Web参数COLPORA的清洁

Semi-Supervised Cleansing of Web Argument Corpora

论文作者

Dorsch, Jonas, Wachsmuth, Henning

论文摘要

辩论门户网站和类似的Web平台构成了计算论证研究及其应用中的主要文本源之一。尽管基于这些来源的语料库富含论证上相关的内容和结构，但它们还包含与其目的无关紧要甚至有害的文本。在本文中，我们提出了一种面向精确的方法，可以半监督的方式检测这种不相关的文本。给定一些种子示例，该方法会自动学习相关性和无关紧要的基本词汇模式，然后从与模式相匹配的句子中逐步引导新模式。根据手动评估，在现有的Args.me语料库中，我们的方法以0.97的精度检测到近87K无关的句子。努力低下，该方法可以适应其他Web参数语料库，从而提供了提高语料库质量的通用方法。

Debate portals and similar web platforms constitute one of the main text sources in computational argumentation research and its applications. While the corpora built upon these sources are rich of argumentatively relevant content and structure, they also include text that is irrelevant, or even detrimental, to their purpose. In this paper, we present a precision-oriented approach to detecting such irrelevant text in a semi-supervised way. Given a few seed examples, the approach automatically learns basic lexical patterns of relevance and irrelevance and then incrementally bootstraps new patterns from sentences matching the patterns. In the existing args.me corpus with 400k argumentative texts, our approach detects almost 87k irrelevant sentences, at a precision of 0.97 according to manual evaluation. With low effort, the approach can be adapted to other web argument corpora, providing a generic way to improve corpus quality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题