论文标题
NATCAT:具有自然注释资源的弱监督文本分类
NatCat: Weakly Supervised Text Classification with Naturally Annotated Resources
论文作者
论文摘要
我们描述了NATCAT,这是一种从三个数据来源构建的文本分类的大规模资源:Wikipedia,stack Exchange和Reddit。 NATCAT由文档类别对组成,这些分类对来自在线社区内自然发生的手动策展。为了证明其有用性,我们通过对NATCAT进行培训来构建通用文本分类器,并在11个文本分类任务(Cateval)的套件中对其进行评估,与先前的工作相比,报告了很大的改进。我们基于不同的建模选择和资源组合,并显示任务如何从特定的NATCAT数据源中受益。
We describe NatCat, a large-scale resource for text classification constructed from three data sources: Wikipedia, Stack Exchange, and Reddit. NatCat consists of document-category pairs derived from manual curation that occurs naturally within online communities. To demonstrate its usefulness, we build general purpose text classifiers by training on NatCat and evaluate them on a suite of 11 text classification tasks (CatEval), reporting large improvements compared to prior work. We benchmark different modeling choices and resource combinations and show how tasks benefit from particular NatCat data sources.