NATCAT：具有自然注释资源的弱监督文本分类

论文标题

NATCAT：具有自然注释资源的弱监督文本分类

NatCat: Weakly Supervised Text Classification with Naturally Annotated Resources

论文作者

Chu, Zewei, Stratos, Karl, Gimpel, Kevin

论文摘要

我们描述了NATCAT，这是一种从三个数据来源构建的文本分类的大规模资源：Wikipedia，stack Exchange和Reddit。 NATCAT由文档类别对组成，这些分类对来自在线社区内自然发生的手动策展。为了证明其有用性，我们通过对NATCAT进行培训来构建通用文本分类器，并在11个文本分类任务（Cateval）的套件中对其进行评估，与先前的工作相比，报告了很大的改进。我们基于不同的建模选择和资源组合，并显示任务如何从特定的NATCAT数据源中受益。

We describe NatCat, a large-scale resource for text classification constructed from three data sources: Wikipedia, Stack Exchange, and Reddit. NatCat consists of document-category pairs derived from manual curation that occurs naturally within online communities. To demonstrate its usefulness, we build general purpose text classifiers by training on NatCat and evaluate them on a suite of 11 text classification tasks (CatEval), reporting large improvements compared to prior work. We benchmark different modeling choices and resource combinations and show how tasks benefit from particular NatCat data sources.

下载PDF全文

下载文献需遵守相关版权规定

论文标题